Template-Type: ReDIF-Article 1.0 Author-Name: Wensheng Zhu Author-X-Name-First: Wensheng Author-X-Name-Last: Zhu Author-Name: Yuan Jiang Author-X-Name-First: Yuan Author-X-Name-Last: Jiang Author-Name: Heping Zhang Author-X-Name-First: Heping Author-X-Name-Last: Zhang Title: Nonparametric Covariate-Adjusted Association Tests Based on the Generalized Kendall's Tau Abstract: Identifying the risk factors for comorbidity is important in psychiatric research. Empirically, studies have shown that testing multiple correlated traits simultaneously is more powerful than testing a single trait at a time in association analysis. Furthermore, for complex diseases, especially mental illnesses and behavioral disorders, the traits are often recorded in different scales, such as dichotomous, ordinal, and quantitative. In the absence of covariates, nonparametric association tests have been developed for multiple complex traits to study comorbidity. However, genetic studies generally contain measurements of some covariates that may affect the relationship between the risk factors of major interest (such as genes) and the outcomes. While it is relatively easy to adjust for these covariates in a parametric model for quantitative traits, it is challenging to adjust for covariates when there are multiple complex traits with possibly different scales. In this article, we propose a nonparametric test for multiple complex traits that can adjust for covariate effects. The test aims to achieve an optimal scheme of adjustment by using a maximum statistic calculated from multiple adjusted test statistics. We derive the asymptotic null distribution of the maximum test statistic and also propose a resampling approach, both of which can be used to assess the significance of our test. Simulations are conducted to compare the Type I error and power of the nonparametric adjusted test to the unadjusted test and other existing adjusted tests. The empirical results suggest that our proposed test increases the power through adjustment for covariates when there exist environmental effects and is more robust to model misspecifications than some existing parametric adjusted tests. We further demonstrate the advantage of our test by analyzing a dataset on genetics of alcoholism. Journal: Journal of the American Statistical Association Pages: 1-11 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.643707 File-URL: http://hdl.handle.net/10.1080/01621459.2011.643707 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:1-11 Template-Type: ReDIF-Article 1.0 Author-Name: Xiaoyan Shi Author-X-Name-First: Xiaoyan Author-X-Name-Last: Shi Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Author-Name: Joseph G. Ibrahim Author-X-Name-First: Joseph G. Author-X-Name-Last: Ibrahim Author-Name: Faming Liang Author-X-Name-First: Faming Author-X-Name-Last: Liang Author-Name: Jeffrey Lieberman Author-X-Name-First: Jeffrey Author-X-Name-Last: Lieberman Author-Name: Martin Styner Author-X-Name-First: Martin Author-X-Name-Last: Styner Title: Intrinsic Regression Models for Medial Representation of Subcortical Structures Abstract: The aim of this article is to develop a semiparametric model to describe the variability of the medial representation of subcortical structures, which belongs to a Riemannian manifold, and establish its association with covariates of interest, such as diagnostic status, age, and gender. We develop a two-stage estimation procedure to calculate the parameter estimates. The first stage is to calculate an intrinsic least squares estimator of the parameter vector using the annealing evolutionary stochastic approximation Monte Carlo algorithm, and then the second stage is to construct a set of estimating equations to obtain a more efficient estimate with the intrinsic least squares estimate as the starting point. We use Wald statistics to test linear hypotheses of unknown parameters and establish their limiting distributions. Simulation studies are used to evaluate the accuracy of our parameter estimates and the finite sample performance of the Wald statistics. We apply our methods to the detection of the difference in the morphological changes of the left and right hippocampi between schizophrenia patients and healthy controls using a medial shape description. This article has online supplementary material. Journal: Journal of the American Statistical Association Pages: 12-23 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.643710 File-URL: http://hdl.handle.net/10.1080/01621459.2011.643710 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:12-23 Template-Type: ReDIF-Article 1.0 Author-Name: Debbie J. Dupuis Author-X-Name-First: Debbie J. Author-X-Name-Last: Dupuis Title: Modeling Waves of Extreme Temperature: The Changing Tails of Four Cities Abstract: Heat waves are a serious threat to society, the environment, and the economy. Estimates of the recurrence probabilities of heat waves may be obtained following the successful modeling of daily maximum temperature, but working with the latter is difficult as we have to recognize, and allow for, not only a time trend but also seasonality in the mean and in the variability, as well as serial correlation. Furthermore, as the extreme values of daily maximum temperature have a different form of nonstationarity from the body, additional modeling is required to completely capture the realities. We present a time series model for the daily maximum temperature and use an exceedance over high thresholds approach to model the upper tail of the distribution of its scaled residuals. We show how a change-point analysis can be used to identify seasons of constant crossing rates and how a time-dependent shape parameter can then be introduced to capture a change in the distribution of the exceedances. Daily maximum temperature series for Des Moines, New York, Portland, and Tucson are analyzed. In-sample and out-of-sample goodness-of-fit measures show that the proposed model is an excellent fit to the data. The fitted model is then used to estimate the recurrence probabilities of runs over seasonally high temperatures, and we show that the probability of long and intense heat waves has increased considerably over 50 years. We also find that the increases vary by city and by time of year. Journal: Journal of the American Statistical Association Pages: 24-39 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.643732 File-URL: http://hdl.handle.net/10.1080/01621459.2011.643732 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:24-39 Template-Type: ReDIF-Article 1.0 Author-Name: Lawrence C. McCandless Author-X-Name-First: Lawrence C. Author-X-Name-Last: McCandless Author-Name: Sylvia Richardson Author-X-Name-First: Sylvia Author-X-Name-Last: Richardson Author-Name: Nicky Best Author-X-Name-First: Nicky Author-X-Name-Last: Best Title: Adjustment for Missing Confounders Using External Validation Data and Propensity Scores Abstract: Reducing bias from missing confounders is a challenging problem in the analysis of observational data. Information about missing variables is sometimes available from external validation data, such as surveys or secondary samples drawn from the same source population. In principle, the validation data permit us to recover information about the missing data, but the difficulty is in eliciting a valid model for the nuisance distribution of the missing confounders. Motivated by a British study of the effects of trihalomethane exposure on risk of full-term low birthweight, we describe a flexible Bayesian procedure for adjusting for a vector of missing confounders using external validation data. We summarize the missing confounders with a scalar summary score using the propensity score methodology of Rosenbaum and Rubin. The score has the property that it induces conditional independence between the exposure and the missing confounders, given the measured confounders. It balances the unmeasured confounders across exposure groups, within levels of measured covariates. To adjust for bias, we need only model and adjust for the summary score during Markov chain Monte Carlo computation. Simulation results illustrate that the proposed method reduces bias from several missing confounders over a range of different sample sizes for the validation data. Appendices A--C are available as online supplementary material. Journal: Journal of the American Statistical Association Pages: 40-51 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.643739 File-URL: http://hdl.handle.net/10.1080/01621459.2011.643739 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:40-51 Template-Type: ReDIF-Article 1.0 Author-Name: James Y. Dai Author-X-Name-First: James Y. Author-X-Name-Last: Dai Author-Name: Peter B. Gilbert Author-X-Name-First: Peter B. Author-X-Name-Last: Gilbert Author-Name: Benoît R. Mâsse Author-X-Name-First: Benoît R. Author-X-Name-Last: Mâsse Title: Partially Hidden Markov Model for Time-Varying Principal Stratification in HIV Prevention Trials Abstract: It is frequently of interest to estimate the intervention effect that adjusts for post-randomization variables in clinical trials. In the recently completed HPTN 035 trial, there is differential condom use between the three microbicide gel arms and the no-gel control arm, so intention-to-treat (ITT) analyses only assess the net treatment effect that includes the indirect treatment effect mediated through differential condom use. Various statistical methods in causal inference have been developed to adjust for post-randomization variables. We extend the principal stratification framework to time-varying behavioral variables in HIV prevention trials with a time-to-event endpoint, using a partially hidden Markov model (pHMM). We formulate the causal estimand of interest, establish assumptions that enable identifiability of the causal parameters, and develop maximum likelihood methods for estimation. Application of our model on the HPTN 035 trial reveals an interesting pattern of prevention effectiveness among different condom-use principal strata. Journal: Journal of the American Statistical Association Pages: 52-65 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.643743 File-URL: http://hdl.handle.net/10.1080/01621459.2011.643743 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:52-65 Template-Type: ReDIF-Article 1.0 Author-Name: Jooyoung Jeon Author-X-Name-First: Jooyoung Author-X-Name-Last: Jeon Author-Name: James W. Taylor Author-X-Name-First: James W. Author-X-Name-Last: Taylor Title: Using Conditional Kernel Density Estimation for Wind Power Density Forecasting Abstract: Of the various renewable energy resources, wind power is widely recognized as one of the most promising. The management of wind farms and electricity systems can benefit greatly from the availability of estimates of the probability distribution of wind power generation. However, most research has focused on point forecasting of wind power. In this article, we develop an approach to producing density forecasts for the wind power generated at individual wind farms. Our interest is in intraday data and prediction from 1 to 72 hours ahead. We model wind power in terms of wind speed and wind direction. In this framework, there are two key uncertainties. First, there is the inherent uncertainty in wind speed and direction, and we model this using a bivariate vector autoregressive moving average-generalized autoregressive conditional heteroscedastic (VARMA-GARCH) model, with a Student t error distribution, in the Cartesian space of wind speed and direction. Second, there is the stochastic nature of the relationship of wind power to wind speed (described by the power curve), and to wind direction. We model this using conditional kernel density (CKD) estimation, which enables a nonparametric modeling of the conditional density of wind power. Using Monte Carlo simulation of the VARMA-GARCH model and CKD estimation, density forecasts of wind speed and direction are converted to wind power density forecasts. Our work is novel in several respects: previous wind power studies have not modeled a stochastic power curve; to accommodate time evolution in the power curve, we incorporate a time decay factor within the CKD method; and the CKD method is conditional on a density, rather than a single value. The new approach is evaluated using datasets from four Greek wind farms. Journal: Journal of the American Statistical Association Pages: 66-79 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.643745 File-URL: http://hdl.handle.net/10.1080/01621459.2011.643745 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:66-79 Template-Type: ReDIF-Article 1.0 Author-Name: Tristan Zajonc Author-X-Name-First: Tristan Author-X-Name-Last: Zajonc Title: Bayesian Inference for Dynamic Treatment Regimes: Mobility, Equity, and Efficiency in Student Tracking Abstract: Policies in health, education, and economics often unfold sequentially and adapt to changing conditions. Such time-varying treatments pose problems for standard program evaluation methods because intermediate outcomes are simultaneously pretreatment confounders and posttreatment outcomes. This article extends the Bayesian perspective on causal inference and optimal treatment to these types of dynamic treatment regimes. A unifying idea remains ignorable treatment assignment, which now sequentially includes selection on intermediate outcomes. I present methods to estimate the causal effect of arbitrary regimes, recover the optimal regime, and characterize the set of feasible outcomes under different regimes. I demonstrate these methods through an application to optimal student tracking in ninth and tenth grade mathematics. For the sample considered, student mobility under the status-quo regime is significantly below the optimal rate and existing policies reinforce between-student inequality. An easy to implement optimal dynamic tracking regime, which promotes more students to honors in tenth grade, increases average final achievement to 0.07 standard deviations above the status quo while lowering inequality; there is no binding equity-efficiency tradeoff. The proposed methods provide a flexible and principled approach to causal inference for time-varying treatments and optimal treatment choice under uncertainty. This article has online supplementary material. Journal: Journal of the American Statistical Association Pages: 80-92 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.643747 File-URL: http://hdl.handle.net/10.1080/01621459.2011.643747 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:80-92 Template-Type: ReDIF-Article 1.0 Author-Name: Alexandre Rodrigues Author-X-Name-First: Alexandre Author-X-Name-Last: Rodrigues Author-Name: Peter J. Diggle Author-X-Name-First: Peter J. Author-X-Name-Last: Diggle Title: Bayesian Estimation and Prediction for Inhomogeneous Spatiotemporal Log-Gaussian Cox Processes Using Low-Rank Models, With Application to Criminal Surveillance Abstract: In this article, we propose a method for conducting likelihood-based inference for a class of nonstationary spatiotemporal log-Gaussian Cox processes. The method uses convolution-based models to capture spatiotemporal correlation structure, is computationally feasible even for large datasets, and does not require knowledge of the underlying spatial intensity of the process. We describe an application to a surveillance system for detecting emergent spatiotemporal clusters of homicides in Belo Horizonte, Brazil, and discuss the advantages and drawbacks of our model-based approach by comparison with other spatiotemporal surveillance methods that have been proposed in the literature. Journal: Journal of the American Statistical Association Pages: 93-101 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.644496 File-URL: http://hdl.handle.net/10.1080/01621459.2011.644496 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:93-101 Template-Type: ReDIF-Article 1.0 Author-Name: Keyur H. Desai Author-X-Name-First: Keyur H. Author-X-Name-Last: Desai Author-Name: John D. Storey Author-X-Name-First: John D. Author-X-Name-Last: Storey Title: Cross-Dimensional Inference of Dependent High-Dimensional Data Abstract: A growing number of modern scientific problems in areas such as genomics, neurobiology, and spatial epidemiology involve the measurement and analysis of thousands of related features that may be stochastically dependent at arbitrarily strong levels. In this work, we consider the scenario where the features follow a multivariate Normal distribution. We demonstrate that dependence is manifested as random variation shared among features, and that standard methods may yield highly unstable inference due to dependence, even when the dependence is fully parameterized and utilized in the procedure. We propose a “cross-dimensional inference” framework that alleviates the problems due to dependence by modeling and removing the variation shared among features, while also properly regularizing estimation across features. We demonstrate the framework on both simultaneous point estimation and multiple hypothesis testing in scenarios derived from the scientific applications of interest. Journal: Journal of the American Statistical Association Pages: 135-151 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.645777 File-URL: http://hdl.handle.net/10.1080/01621459.2011.645777 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:135-151 Template-Type: ReDIF-Article 1.0 Author-Name: Bing Li Author-X-Name-First: Bing Author-X-Name-Last: Li Author-Name: Hyonho Chun Author-X-Name-First: Hyonho Author-X-Name-Last: Chun Author-Name: Hongyu Zhao Author-X-Name-First: Hongyu Author-X-Name-Last: Zhao Title: Sparse Estimation of Conditional Graphical Models With Application to Gene Networks Abstract: In many applications the graph structure in a network arises from two sources: intrinsic connections and connections due to external effects. We introduce a sparse estimation procedure for graphical models that is capable of isolating the intrinsic connections by removing the external effects. Technically, this is formulated as a conditional graphical model, in which the external effects are modeled as predictors, and the graph is determined by the conditional precision matrix. We introduce two sparse estimators of this matrix using the reproduced kernel Hilbert space combined with lasso and adaptive lasso. We establish the sparsity, variable selection consistency, oracle property, and the asymptotic distributions of the proposed estimators. We also develop their convergence rate when the dimension of the conditional precision matrix goes to infinity. The methods are compared with sparse estimators for unconditional graphical models, and with the constrained maximum likelihood estimate that assumes a known graph structure. The methods are applied to a genetic data set to construct a gene network conditioning on single-nucleotide polymorphisms. Journal: Journal of the American Statistical Association Pages: 152-167 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.644498 File-URL: http://hdl.handle.net/10.1080/01621459.2011.644498 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:152-167 Template-Type: ReDIF-Article 1.0 Author-Name: Yanyuan Ma Author-X-Name-First: Yanyuan Author-X-Name-Last: Ma Author-Name: Liping Zhu Author-X-Name-First: Liping Author-X-Name-Last: Zhu Title: A Semiparametric Approach to Dimension Reduction Abstract: We provide a novel and completely different approach to dimension-reduction problems from the existing literature. We cast the dimension-reduction problem in a semiparametric estimation framework and derive estimating equations. Viewing this problem from the new angle allows us to derive a rich class of estimators, and obtain the classical dimension reduction techniques as special cases in this class. The semiparametric approach also reveals that in the inverse regression context while keeping the estimation structure intact, the common assumption of linearity and/or constant variance on the covariates can be removed at the cost of performing additional nonparametric regression. The semiparametric estimators without these common assumptions are illustrated through simulation studies and a real data example. This article has online supplementary material. Journal: Journal of the American Statistical Association Pages: 168-179 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.646925 File-URL: http://hdl.handle.net/10.1080/01621459.2011.646925 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:168-179 Template-Type: ReDIF-Article 1.0 Author-Name: Tatiyana V. Apanasovich Author-X-Name-First: Tatiyana V. Author-X-Name-Last: Apanasovich Author-Name: Marc G. Genton Author-X-Name-First: Marc G. Author-X-Name-Last: Genton Author-Name: Ying Sun Author-X-Name-First: Ying Author-X-Name-Last: Sun Title: A Valid Matérn Class of Cross-Covariance Functions for Multivariate Random Fields With Any Number of Components Abstract: We introduce a valid parametric family of cross-covariance functions for multivariate spatial random fields where each component has a covariance function from a well-celebrated Matérn class. Unlike previous attempts, our model indeed allows for various smoothnesses and rates of correlation decay for any number of vector components. We present the conditions on the parameter space that result in valid models with varying degrees of complexity. We discuss practical implementations, including reparameterizations to reflect the conditions on the parameter space and an iterative algorithm to increase the computational efficiency. We perform various Monte Carlo simulation experiments to explore the performances of our approach in terms of estimation and cokriging. The application of the proposed multivariate Matérn model is illustrated on two meteorological datasets: temperature/pressure over the Pacific Northwest (bivariate) and wind/temperature/pressure in Oklahoma (trivariate). In the latter case, our flexible trivariate Matérn model is valid and yields better predictive scores compared with a parsimonious model with common scale parameters. Journal: Journal of the American Statistical Association Pages: 180-193 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.643197 File-URL: http://hdl.handle.net/10.1080/01621459.2011.643197 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:180-193 Template-Type: ReDIF-Article 1.0 Author-Name: Huixia Judy Wang Author-X-Name-First: Huixia Judy Author-X-Name-Last: Wang Author-Name: Xingdong Feng Author-X-Name-First: Xingdong Author-X-Name-Last: Feng Title: Multiple Imputation for M-Regression With Censored Covariates Abstract: We develop a new multiple imputation approach for M-regression models with censored covariates. Instead of specifying parametric likelihoods, our method imputes the censored covariates by their conditional quantiles given the observed data, where the conditional quantiles are estimated through fitting a censored quantile regression process. The resulting estimator is shown to be consistent and asymptotically normal, and it improves the estimation efficiency by using information from cases with censored covariates. Compared with existing methods, the proposed method is more flexible as it does not require stringent parametric assumptions on the distributions of either the regression errors or the covariates. The finite sample performance of the proposed method is assessed through a simulation study and the analysis of a c-reactive protein dataset in the 2007--2008 National Health and Nutrition Examination Survey. This article has supplementary material online. Journal: Journal of the American Statistical Association Pages: 194-204 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.643198 File-URL: http://hdl.handle.net/10.1080/01621459.2011.643198 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:194-204 Template-Type: ReDIF-Article 1.0 Author-Name: Qian M. Zhou Author-X-Name-First: Qian M. Author-X-Name-Last: Zhou Author-Name: Peter X.-K. Song Author-X-Name-First: Peter X.-K. Author-X-Name-Last: Song Author-Name: Mary E. Thompson Author-X-Name-First: Mary E. Author-X-Name-Last: Thompson Title: Information Ratio Test for Model Misspecification in Quasi-Likelihood Inference Abstract: In this article, we focus on the circumstances in quasi-likelihood inference that the estimation accuracy of mean structure parameters is guaranteed by correct specification of the first moment, but the estimation efficiency could be diminished due to misspecification of the second moment. We propose an information ratio (IR) statistic to test for model misspecification of the variance/covariance structure through a comparison between two forms of information matrix: the negative sensitivity matrix and the variability matrix. We establish asymptotic distributions of the proposed IR test statistics. We also suggest an approximation to the asymptotic distribution of the IR statistic via a perturbation resampling method. Moreover, we propose a selection criterion based on the IR test to select the best fitting variance/covariance structure from a class of candidates. Through simulation studies, it is shown that the IR statistic provides a powerful statistical tool to detect different scenarios of misspecification of the variance/covariance structures. In addition, the IR test as well as the proposed model selection procedure shows substantial improvement over some of the existing statistical methods. The IR-based model selection procedure is illustrated by analyzing the Madras Longitudinal Schizophrenia data. Appendices are included in the supplemental materials, which are available online. Journal: Journal of the American Statistical Association Pages: 205-213 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.645785 File-URL: http://hdl.handle.net/10.1080/01621459.2011.645785 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:205-213 Template-Type: ReDIF-Article 1.0 Author-Name: Lan Wang Author-X-Name-First: Lan Author-X-Name-Last: Wang Author-Name: Yichao Wu Author-X-Name-First: Yichao Author-X-Name-Last: Wu Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Title: Quantile Regression for Analyzing Heterogeneity in Ultra-High Dimension Abstract: Ultra-high dimensional data often display heterogeneity due to either heteroscedastic variance or other forms of non-location-scale covariate effects. To accommodate heterogeneity, we advocate a more general interpretation of sparsity, which assumes that only a small number of covariates influence the conditional distribution of the response variable, given all candidate covariates; however, the sets of relevant covariates may differ when we consider different segments of the conditional distribution. In this framework, we investigate the methodology and theory of nonconvex, penalized quantile regression in ultra-high dimension. The proposed approach has two distinctive features: (1) It enables us to explore the entire conditional distribution of the response variable, given the ultra-high-dimensional covariates, and provides a more realistic picture of the sparsity pattern; (2) it requires substantially weaker conditions compared with alternative methods in the literature; thus, it greatly alleviates the difficulty of model checking in the ultra-high dimension. In theoretic development, it is challenging to deal with both the nonsmooth loss function and the nonconvex penalty function in ultra-high-dimensional parameter space. We introduce a novel, sufficient optimality condition that relies on a convex differencing representation of the penalized loss function and the subdifferential calculus. Exploring this optimality condition enables us to establish the oracle property for sparse quantile regression in the ultra-high dimension under relaxed conditions. The proposed method greatly enhances existing tools for ultra-high-dimensional data analysis. Monte Carlo simulations demonstrate the usefulness of the proposed procedure. The real data example we analyzed demonstrates that the new approach reveals substantially more information as compared with alternative methods. This article has online supplementary material. Journal: Journal of the American Statistical Association Pages: 214-222 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2012.656014 File-URL: http://hdl.handle.net/10.1080/01621459.2012.656014 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:214-222 Template-Type: ReDIF-Article 1.0 Author-Name: Xiaotong Shen Author-X-Name-First: Xiaotong Author-X-Name-Last: Shen Author-Name: Wei Pan Author-X-Name-First: Wei Author-X-Name-Last: Pan Author-Name: Yunzhang Zhu Author-X-Name-First: Yunzhang Author-X-Name-Last: Zhu Title: Likelihood-Based Selection and Sharp Parameter Estimation Abstract: In high-dimensional data analysis, feature selection becomes one effective means for dimension reduction, which proceeds with parameter estimation. Concerning accuracy of selection and estimation, we study nonconvex constrained and regularized likelihoods in the presence of nuisance parameters. Theoretically, we show that constrained L 0 likelihood and its computational surrogate are optimal in that they achieve feature selection consistency and sharp parameter estimation, under one necessary condition required for any method to be selection consistent and to achieve sharp parameter estimation. It permits up to exponentially many candidate features. Computationally, we develop difference convex methods to implement the computational surrogate through prime and dual subproblems. These results establish a central role of L 0 constrained and regularized likelihoods in feature selection and parameter estimation involving selection. As applications of the general method and theory, we perform feature selection in linear regression and logistic regression, and estimate a precision matrix in Gaussian graphical models. In these situations, we gain a new theoretical insight and obtain favorable numerical results. Finally, we discuss an application to predict the metastasis status of breast cancer patients with their gene expression profiles. This article has online supplementary material. Journal: Journal of the American Statistical Association Pages: 223-232 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.645783 File-URL: http://hdl.handle.net/10.1080/01621459.2011.645783 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:223-232 Template-Type: ReDIF-Article 1.0 Author-Name: Daniel J. Nordman Author-X-Name-First: Daniel J. Author-X-Name-Last: Nordman Author-Name: Soumendra N. Lahiri Author-X-Name-First: Soumendra N. Author-X-Name-Last: Lahiri Title: Block Bootstraps for Time Series With Fixed Regressors Abstract: This article examines block bootstrap methods in linear regression models with weakly dependent error variables and nonstochastic regressors. Contrary to intuition, the tapered block bootstrap (TBB) with a smooth taper not only loses its superior bias properties but may also fail to be consistent in the regression problem. A similar problem, albeit at a smaller scale, is shown to exist for the moving and the circular block bootstrap (MBB and CBB, respectively). As a remedy, an additional block randomization step is introduced that balances out the effects of nonuniform regression weights, and restores the superiority of the (modified) TBB. The randomization step also improves the MBB or CBB. Interestingly, the stationary bootstrap (SB) automatically balances out regression weights through its probabilistic blocking mechanism, without requiring any modification, and enjoys a kind of robustness. Optimal block sizes are explicitly determined for block bootstrap variance estimators under regression. Finite sample performance and practical uses of the methods are illustrated through a simulation study and two data examples, respectively. Supplementary materials are available online. Journal: Journal of the American Statistical Association Pages: 233-246 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.646929 File-URL: http://hdl.handle.net/10.1080/01621459.2011.646929 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:233-246 Template-Type: ReDIF-Article 1.0 Author-Name: Zonghui Hu Author-X-Name-First: Zonghui Author-X-Name-Last: Hu Author-Name: Dean A. Follmann Author-X-Name-First: Dean A. Author-X-Name-Last: Follmann Author-Name: Jing Qin Author-X-Name-First: Jing Author-X-Name-Last: Qin Title: Semiparametric Double Balancing Score Estimation for Incomplete Data With Ignorable Missingness Abstract: When estimating the marginal mean response with missing observations, a critical issue is robustness to model misspecification. In this article, we propose a semiparametric estimation method with extended double robustness that attains the optimal efficiency under less stringent requirement for model specifications than the doubly robust estimators. In this semiparametric estimation, covariate information is collapsed into a two-dimensional score S, with one dimension for (i) the pattern of missingness and the other for (ii) the pattern of response, both estimated from some working parametric models. The mean response E(Y) is then estimated by the sample mean of E(YS), which is estimated via nonparametric regression. The semiparametric estimator is consistent if either the “core” of (i) or the “core” of (ii) is captured by S, and attains the optimal efficiency if both are captured by S. As the “cores” can be obtained without correctly specifying the full parametric models for (i) or (ii), the proposed estimator can be more robust than other doubly robust estimators. As S contains the propensity score as one component, the proposed estimator avoids the use and the shortcomings of inverse propensity weighting. This semiparametric estimator is most appealing for high-dimensional covariates, where fully correct model specification is challenging and nonparametric estimation is not feasible due to the problem of dimensionality. Numerical performance is investigated by simulation studies. Journal: Journal of the American Statistical Association Pages: 247-257 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2012.656009 File-URL: http://hdl.handle.net/10.1080/01621459.2012.656009 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:247-257 Template-Type: ReDIF-Article 1.0 Author-Name: Gaurav Sharma Author-X-Name-First: Gaurav Author-X-Name-Last: Sharma Author-Name: Thomas Mathew Author-X-Name-First: Thomas Author-X-Name-Last: Mathew Title: One-Sided and Two-Sided Tolerance Intervals in General Mixed and Random Effects Models Using Small-Sample Asymptotics Abstract: The computation of tolerance intervals in mixed and random effects models has not been satisfactorily addressed in a general setting when the data are unbalanced and/or when covariates are present. This article derives satisfactory one-sided and two-sided tolerance intervals in such a general scenario, by applying small-sample asymptotic procedures. In the case of one-sided tolerance limits, the problem reduces to the interval estimation of a percentile, and accurate confidence limits are derived using small-sample asymptotics. In the case of a two-sided tolerance interval, the problem does not reduce to an interval estimation problem; however, it is possible to derive an approximate margin of error statistic that is an upper confidence limit for a linear combination of the variance components. For the latter problem, small-sample asymptotic procedures can once again be used to arrive at an accurate upper confidence limit. In the article, balanced and unbalanced data situations are treated separately, and computational issues are addressed in detail. Extensive numerical results show that the tolerance intervals derived based on small-sample asymptotics exhibit satisfactory performance regardless of the sample size. The results are illustrated using some examples. Some technical derivations, additional simulation results, and R codes are available online as supplementary materials. Journal: Journal of the American Statistical Association Pages: 258-267 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.640592 File-URL: http://hdl.handle.net/10.1080/01621459.2011.640592 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:258-267 Template-Type: ReDIF-Article 1.0 Author-Name: Moreno Bevilacqua Author-X-Name-First: Moreno Author-X-Name-Last: Bevilacqua Author-Name: Carlo Gaetan Author-X-Name-First: Carlo Author-X-Name-Last: Gaetan Author-Name: Jorge Mateu Author-X-Name-First: Jorge Author-X-Name-Last: Mateu Author-Name: Emilio Porcu Author-X-Name-First: Emilio Author-X-Name-Last: Porcu Title: Estimating Space and Space-Time Covariance Functions for Large Data Sets: A Weighted Composite Likelihood Approach Abstract: In this article, we propose two methods for estimating space and space-time covariance functions from a Gaussian random field, based on the composite likelihood idea. The first method relies on the maximization of a weighted version of the composite likelihood function, while the second one is based on the solution of a weighted composite score equation. This last scheme is quite general and could be applied to any kind of composite likelihood. An information criterion for model selection based on the first estimation method is also introduced. The methods are useful for practitioners looking for a good balance between computational complexity and statistical efficiency. The effectiveness of the methods is illustrated through examples, simulation experiments, and by analyzing a dataset on ozone measurements. Journal: Journal of the American Statistical Association Pages: 268-280 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.646928 File-URL: http://hdl.handle.net/10.1080/01621459.2011.646928 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:268-280 Template-Type: ReDIF-Article 1.0 Author-Name: Luke Bornn Author-X-Name-First: Luke Author-X-Name-Last: Bornn Author-Name: Gavin Shaddick Author-X-Name-First: Gavin Author-X-Name-Last: Shaddick Author-Name: James V. Zidek Author-X-Name-First: James V. Author-X-Name-Last: Zidek Title: Modeling Nonstationary Processes Through Dimension Expansion Abstract: In this article, we propose a novel approach to modeling nonstationary spatial fields. The proposed method works by expanding the geographic plane over which these processes evolve into higher-dimensional spaces, transforming and clarifying complex patterns in the physical plane. By combining aspects of multidimensional scaling, group lasso, and latent variable models, a dimensionally sparse projection is found in which the originally nonstationary field exhibits stationarity. Following a comparison with existing methods in a simulated environment, dimension expansion is studied on a classic test-bed dataset historically used to study nonstationary models. Following this, we explore the use of dimension expansion in modeling air pollution in the United Kingdom, a process known to be strongly influenced by rural/urban effects, amongst others, which gives rise to a nonstationary field. Journal: Journal of the American Statistical Association Pages: 281-289 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.646919 File-URL: http://hdl.handle.net/10.1080/01621459.2011.646919 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:281-289 Template-Type: ReDIF-Article 1.0 Author-Name: Michael S. Smith Author-X-Name-First: Michael S. Author-X-Name-Last: Smith Author-Name: Mohamad A. Khaled Author-X-Name-First: Mohamad A. Author-X-Name-Last: Khaled Title: Estimation of Copula Models With Discrete Margins via Bayesian Data Augmentation Abstract: Estimation of copula models with discrete margins can be difficult beyond the bivariate case. We show how this can be achieved by augmenting the likelihood with continuous latent variables, and computing inference using the resulting augmented posterior. To evaluate this, we propose two efficient Markov chain Monte Carlo sampling schemes. One generates the latent variables as a block using a Metropolis--Hastings step with a proposal that is close to its target distribution, the other generates them one at a time. Our method applies to all parametric copulas where the conditional copula functions can be evaluated, not just elliptical copulas as in much previous work. Moreover, the copula parameters can be estimated joint with any marginal parameters, and Bayesian selection ideas can be employed. We establish the effectiveness of the estimation method by modeling consumer behavior in online retail using Archimedean and Gaussian copulas. The example shows that elliptical copulas can be poor at modeling dependence in discrete data, just as they can be in the continuous case. To demonstrate the potential in higher dimensions, we estimate 16-dimensional D-vine copulas for a longitudinal model of usage of a bicycle path in the city of Melbourne, Australia. The estimates reveal an interesting serial dependence structure that can be represented in a parsimonious fashion using Bayesian selection of independence pair-copula components. Finally, we extend our results and method to the case where some margins are discrete and others continuous. Supplemental materials for the article are also available online. Journal: Journal of the American Statistical Association Pages: 290-303 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.644501 File-URL: http://hdl.handle.net/10.1080/01621459.2011.644501 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:290-303 Template-Type: ReDIF-Article 1.0 Author-Name: Yulia V. Marchenko Author-X-Name-First: Yulia V. Author-X-Name-Last: Marchenko Author-Name: Marc G. Genton Author-X-Name-First: Marc G. Author-X-Name-Last: Genton Title: A Heckman Selection-t Model Abstract: Sample selection arises often in practice as a result of the partial observability of the outcome of interest in a study. In the presence of sample selection, the observed data do not represent a random sample from the population, even after controlling for explanatory variables. That is, data are missing not at random. Thus, standard analysis using only complete cases will lead to biased results. Heckman introduced a sample selection model to analyze such data and proposed a full maximum likelihood estimation method under the assumption of normality. The method was criticized in the literature because of its sensitivity to the normality assumption. In practice, data, such as income or expenditure data, often violate the normality assumption because of heavier tails. We first establish a new link between sample selection models and recently studied families of extended skew-elliptical distributions. Then, this allows us to introduce a selection-t (SLt) model, which models the error distribution using a Student's t distribution. We study its properties and investigate the finite-sample performance of the maximum likelihood estimators for this model. We compare the performance of the SLt model to the conventional Heckman selection-normal (SLN) model and apply it to analyze ambulatory expenditures. Unlike the SLN model, our analysis using the SLt model provides statistical evidence for the existence of sample selection bias in these data. We also investigate the performance of the test for sample selection bias based on the SLt model and compare it with the performances of several tests used with the SLN model. Our findings indicate that the latter tests can be misleading in the presence of heavy-tailed data. Journal: Journal of the American Statistical Association Pages: 304-317 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2012.656011 File-URL: http://hdl.handle.net/10.1080/01621459.2012.656011 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:304-317 Template-Type: ReDIF-Article 1.0 Author-Name: Ying Qing Chen Author-X-Name-First: Ying Qing Author-X-Name-Last: Chen Author-Name: Nan Hu Author-X-Name-First: Nan Author-X-Name-Last: Hu Author-Name: Su-Chun Cheng Author-X-Name-First: Su-Chun Author-X-Name-Last: Cheng Author-Name: Philippa Musoke Author-X-Name-First: Philippa Author-X-Name-Last: Musoke Author-Name: Lue Ping Zhao Author-X-Name-First: Lue Ping Author-X-Name-Last: Zhao Title: Estimating Regression Parameters in an Extended Proportional Odds Model Abstract: The proportional odds model may serve as a useful alternative to the Cox proportional hazards model to study association between covariates and their survival functions in medical studies. In this article, we study an extended proportional odds model that incorporates the so-called “external” time-varying covariates. In the extended model, regression parameters have a direct interpretation of comparing survival functions, without specifying the baseline survival odds function. Semiparametric and maximum likelihood estimation procedures are proposed to estimate the extended model. Our methods are demonstrated by Monte Carlo simulations, and applied to a landmark randomized clinical trial of a short-course nevirapine (NVP) for mother-to-child transmission (MTCT) of human immunodeficiency virus type-1 (HIV-1). Additional application includes an analysis of the well-known Veterans Administration (VA) lung cancer trial. Journal: Journal of the American Statistical Association Pages: 318-330 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2012.656021 File-URL: http://hdl.handle.net/10.1080/01621459.2012.656021 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:318-330 Template-Type: ReDIF-Article 1.0 Author-Name: Ruoqing Zhu Author-X-Name-First: Ruoqing Author-X-Name-Last: Zhu Author-Name: Michael R. Kosorok Author-X-Name-First: Michael R. Author-X-Name-Last: Kosorok Title: Recursively Imputed Survival Trees Abstract: We propose recursively imputed survival tree (RIST) regression for right-censored data. This new nonparametric regression procedure uses a novel recursive imputation approach combined with extremely randomized trees that allows significantly better use of censored data than previous tree-based methods, yielding improved model fit and reduced prediction error. The proposed method can also be viewed as a type of Monte Carlo EM algorithm, which generates extra diversity in the tree-based fitting process. Simulation studies and data analyses demonstrate the superior performance of RIST compared with previous methods. Journal: Journal of the American Statistical Association Pages: 331-340 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.637468 File-URL: http://hdl.handle.net/10.1080/01621459.2011.637468 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:331-340 Template-Type: ReDIF-Article 1.0 Author-Name: Sebastian Irle Author-X-Name-First: Sebastian Author-X-Name-Last: Irle Author-Name: Helmut Schäfer Author-X-Name-First: Helmut Author-X-Name-Last: Schäfer Title: Interim Design Modifications in Time-to-Event Studies Abstract: We propose a flexible method for interim design modifications in time-to-event studies. With this method, it is possible to inspect the data at any time during the course of the study, without the need for prespecification of a learning phase, and to make certain types of design modifications depending on the interim data without compromising the Type I error risk. The method can be applied to studies designed with a conventional statistical test, fixed sample, or group sequential, even when no adaptive interim analysis and no specific method for design adaptations (such as combination tests) had been foreseen in the protocol. Currently, the method supports design changes such as an extension of the recruitment or follow-up period, as well as certain modifications of the number and the schedule of interim analyses as well as changes of inclusion criteria. In contrast to existing methods offering the same flexibility, our approach allows us to make use of the full interim information collected until the time of the adaptive data inspection. This includes time-to-event data from patients who have already experienced an event at the time of the data inspection, and preliminary information from patients still alive, even if this information is predictive for survival, such as early treatment response in a cancer clinical trial. Our method is an extension of the so-called conditional rejection probability (CRP) principle. It is based on the conditional distribution of the test statistic given the final value of the same test statistic from a subsample, namely the learning sample. It is developed in detail for the example of the logrank statistic, for which we derive this conditional distribution using martingale techniques. Journal: Journal of the American Statistical Association Pages: 341-348 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.644141 File-URL: http://hdl.handle.net/10.1080/01621459.2011.644141 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:341-348 Template-Type: ReDIF-Article 1.0 Author-Name: Bodhisattva Sen Author-X-Name-First: Bodhisattva Author-X-Name-Last: Sen Author-Name: Probal Chaudhuri Author-X-Name-First: Probal Author-X-Name-Last: Chaudhuri Title: On Fractile Transformation of Covariates in Regression Abstract: The need for comparing two regression functions arises frequently in statistical applications. Comparison of the usual regression functions is not very meaningful in situations where the distributions and the ranges of the covariates are different for the populations. For instance, in econometric studies, the prices of commodities and people's incomes observed at different time points may not be on comparable scales due to inflation and other economic factors. In this article, we describe a method of standardizing the covariates and estimating the transformed regression function, which then become comparable. We develop smooth estimates of the fractile regression function and study its statistical properties analytically as well as numerically. We also provide a few real examples that illustrate the difficulty in comparing the usual regression functions and motivate the need for the fractile transformation. Our analysis of the real examples leads to new and useful statistical conclusions that are missed by comparison of the usual regression functions. Journal: Journal of the American Statistical Association Pages: 349-361 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.646916 File-URL: http://hdl.handle.net/10.1080/01621459.2011.646916 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:349-361 Template-Type: ReDIF-Article 1.0 Author-Name: Anirban Bhattacharya Author-X-Name-First: Anirban Author-X-Name-Last: Bhattacharya Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Simplex Factor Models for Multivariate Unordered Categorical Data Abstract: Gaussian latent factor models are routinely used for modeling of dependence in continuous, binary, and ordered categorical data. For unordered categorical variables, Gaussian latent factor models lead to challenging computation and complex modeling structures. As an alternative, we propose a novel class of simplex factor models. In the single-factor case, the model treats the different categorical outcomes as independent with unknown marginals. The model can characterize flexible dependence structures parsimoniously with few factors, and as factors are added, any multivariate categorical data distribution can be accurately approximated. Using a Bayesian approach for computation and inferences, a Markov chain Monte Carlo (MCMC) algorithm is proposed that scales well with increasing dimension, with the number of factors treated as unknown. We develop an efficient proposal for updating the base probability vector in hierarchical Dirichlet models. Theoretical properties are described, and we evaluate the approach through simulation examples. Applications are described for modeling dependence in nucleotide sequences and prediction from high-dimensional categorical features. Journal: Journal of the American Statistical Association Pages: 362-377 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.646934 File-URL: http://hdl.handle.net/10.1080/01621459.2011.646934 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:362-377 Template-Type: ReDIF-Article 1.0 Author-Name: Ranjan Maitra Author-X-Name-First: Ranjan Author-X-Name-Last: Maitra Author-Name: Volodymyr Melnykov Author-X-Name-First: Volodymyr Author-X-Name-Last: Melnykov Author-Name: Soumendra N. Lahiri Author-X-Name-First: Soumendra N. Author-X-Name-Last: Lahiri Title: Bootstrapping for Significance of Compact Clusters in Multidimensional Datasets Abstract: This article proposes a bootstrap approach for assessing significance in the clustering of multidimensional datasets. The procedure compares two models and declares the more complicated model a better candidate if there is significant evidence in its favor. The performance of the procedure is illustrated on two well-known classification datasets and comprehensively evaluated in terms of its ability to estimate the number of components via extensive simulation studies, with excellent results. The methodology is also applied to the problem of k-means color quantization of several standard images in the literature and is demonstrated to be a viable approach for determining the minimal and optimal numbers of colors needed to display an image without significant loss in resolution. Additional illustrations and performance evaluations are provided in the online supplementary material. Journal: Journal of the American Statistical Association Pages: 378-392 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.646935 File-URL: http://hdl.handle.net/10.1080/01621459.2011.646935 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:378-392 Template-Type: ReDIF-Article 1.0 Author-Name: Peter Z. G. Qian Author-X-Name-First: Peter Z. G. Author-X-Name-Last: Qian Title: Sliced Latin Hypercube Designs Abstract: This article proposes a method for constructing a new type of space-filling design, called a sliced Latin hypercube design, intended for running computer experiments. Such a design is a special Latin hypercube design that can be partitioned into slices of smaller Latin hypercube designs. It is desirable to use the constructed designs for collective evaluations of computer models and ensembles of multiple computer models. The proposed construction method is easy to implement, capable of accommodating any number of factors, and flexible in run size. Examples are given to illustrate the method. Sampling properties of the constructed designs are examined. Numerical illustration is provided to corroborate the derived theoretical results. Journal: Journal of the American Statistical Association Pages: 393-399 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2011.644132 File-URL: http://hdl.handle.net/10.1080/01621459.2011.644132 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:393-399 Template-Type: ReDIF-Article 1.0 Author-Name: Dávid Papp Author-X-Name-First: Dávid Author-X-Name-Last: Papp Title: Optimal Designs for Rational Function Regression Abstract: We consider the problem of finding optimal nonsequential designs for a large class of regression models involving polynomials and rational functions with heteroscedastic noise also given by a polynomial or rational weight function. Since the design weights can be found easily by existing methods once the support is known, we concentrate on determining the support of the optimal design. The proposed method treats D-, E-, A-, and Φ p -optimal designs in a unified manner, and generates a polynomial whose zeros are the support points of the optimal approximate design, generalizing a number of previously known results of the same flavor. The method is based on a mathematical optimization model that can incorporate various criteria of optimality and can be solved efficiently by well-established numerical optimization methods. In contrast to optimization-based methods previously proposed for the solution of similar design problems, our method also has theoretical guarantee of its algorithmic efficiency; in concordance with the theory, the actual running times of all numerical examples considered in the paper are negligible. The numerical stability of the method is demonstrated in an example involving high-degree polynomials. As a corollary, an upper bound on the size of the support set of the minimally supported optimal designs is also found. Journal: Journal of the American Statistical Association Pages: 400-411 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2012.656035 File-URL: http://hdl.handle.net/10.1080/01621459.2012.656035 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:400-411 Template-Type: ReDIF-Article 1.0 Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Yingying Li Author-X-Name-First: Yingying Author-X-Name-Last: Li Author-Name: Ke Yu Author-X-Name-First: Ke Author-X-Name-Last: Yu Title: Vast Volatility Matrix Estimation Using High-Frequency Data for Portfolio Selection Abstract: Portfolio allocation with gross-exposure constraint is an effective method to increase the efficiency and stability of portfolios selection among a vast pool of assets, as demonstrated by Fan, Zhang, and Yu. The required high-dimensional volatility matrix can be estimated by using high-frequency financial data. This enables us to better adapt to the local volatilities and local correlations among a vast number of assets and to increase significantly the sample size for estimating the volatility matrix. This article studies the volatility matrix estimation using high-dimensional, high-frequency data from the perspective of portfolio selection. Specifically, we propose the use of “pairwise-refresh time” and “all-refresh time” methods based on the concept of “refresh time” proposed by Barndorff-Nielsen, Hansen, Lunde, and Shephard for the estimation of vast covariance matrix and compare their merits in the portfolio selection. We establish the concentration inequalities of the estimates, which guarantee desirable properties of the estimated volatility matrix in vast asset allocation with gross-exposure constraints. Extensive numerical studies are made via carefully designed simulations. Comparing with the methods based on low-frequency daily data, our methods can capture the most recent trend of the time varying volatility and correlation, hence provide more accurate guidance for the portfolio allocation in the next time period. The advantage of using high-frequency data is significant in our simulation and empirical studies, which consist of 50 simulated assets and 30 constituent stocks of Dow Jones Industrial Average index. Journal: Journal of the American Statistical Association Pages: 412-428 Issue: 497 Volume: 107 Year: 2012 Month: 3 X-DOI: 10.1080/01621459.2012.656041 File-URL: http://hdl.handle.net/10.1080/01621459.2012.656041 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:412-428 Template-Type: ReDIF-Article 1.0 Author-Name: Lane F. Burgette Author-X-Name-First: Lane F. Author-X-Name-Last: Burgette Author-Name: Jerome P. Reiter Author-X-Name-First: Jerome P. Author-X-Name-Last: Reiter Title: Nonparametric Bayesian Multiple Imputation for Missing Data Due to Mid-Study Switching of Measurement Methods Abstract: Investigators often change how variables are measured during the middle of data-collection, for example, in hopes of obtaining greater accuracy or reducing costs. The resulting data comprise sets of observations measured on two (or more) different scales, which complicates interpretation and can create bias in analyses that rely directly on the differentially measured variables. We develop approaches based on multiple imputation for handling mid-study changes in measurement for settings without calibration data, that is, no subjects are measured on both (all) scales. This setting creates a seemingly insurmountable problem for multiple imputation: since the measurements never appear jointly, there is no information in the data about their association. We resolve the problem by making an often scientifically reasonable assumption that each measurement regime accurately ranks the samples but on differing scales, so that, for example, an individual at the qth percentile on one scale should be at about the qth percentile on the other scale. We use rank-preservation assumptions to develop three imputation strategies that flexibly transform measurements made in one scale to measurements made in another: a Markov chain Monte Carlo (MCMC)-free approach based on permuting ranks of measurements, and two approaches based on dependent Dirichlet process (DDP) mixture models for imputing values conditional on covariates. We use simulations to illustrate conditions under which each strategy performs well, and present guidance on when to apply each. We apply these methods to a study of birth outcomes in which investigators collected mothers’ blood samples to measure levels of environmental contaminants. Midway through data ascertainment, the study switched from one analytical lab to another. The distributions of blood lead levels differ greatly across the two labs, suggesting that the labs report measurements according to different scales. We use nonparametric Bayesian imputation models to obtain sets of plausible measurements on a common scale, and estimate quantile regressions of birth weight on various environmental contaminants. Journal: Journal of the American Statistical Association Pages: 439-449 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2011.643713 File-URL: http://hdl.handle.net/10.1080/01621459.2011.643713 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:439-449 Template-Type: ReDIF-Article 1.0 Author-Name: Paolo Frumento Author-X-Name-First: Paolo Author-X-Name-Last: Frumento Author-Name: Fabrizia Mealli Author-X-Name-First: Fabrizia Author-X-Name-Last: Mealli Author-Name: Barbara Pacini Author-X-Name-First: Barbara Author-X-Name-Last: Pacini Author-Name: Donald B. Rubin Author-X-Name-First: Donald B. Author-X-Name-Last: Rubin Title: Evaluating the Effect of Training on Wages in the Presence of Noncompliance, Nonemployment, and Missing Outcome Data Abstract: The effects of a job training program, Job Corps, on both employment and wages are evaluated using data from a randomized study. Principal stratification is used to address, simultaneously, the complications of noncompliance, wages that are only partially defined because of nonemployment, and unintended missing outcomes. The first two complications are of substantive interest, whereas the third is a nuisance. The objective is to find a parsimonious model that can be used to inform public policy. We conduct a likelihood-based analysis using finite mixture models estimated by the expectation-maximization (EM) algorithm. We maintain an exclusion restriction assumption for the effect of assignment on employment and wages for noncompliers, but not on missingness. We provide estimates under the “missing at random” assumption, and assess the robustness of our results to deviations from it. The plausibility of meaningful restrictions is investigated by means of scaled log-likelihood ratio statistics. Substantive conclusions include the following. For compliers, the effect on employment is negative in the short term; it becomes positive in the long term, but these effects are small at best. For always employed compliers, that is, compliers who are employed whether trained or not trained, positive effects on wages are found at all time periods. Our analysis reveals that background characteristics of individuals differ markedly across the principal strata. We found evidence that the program should have been better targeted, in the sense of being designed differently for different groups of people, and specific suggestions are offered. Previous analyses of this dataset, which did not address all complications in a principled manner, led to less nuanced conclusions about Job Corps. Journal: Journal of the American Statistical Association Pages: 450-466 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2011.643719 File-URL: http://hdl.handle.net/10.1080/01621459.2011.643719 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:450-466 Template-Type: ReDIF-Article 1.0 Author-Name: Earvin Balderama Author-X-Name-First: Earvin Author-X-Name-Last: Balderama Author-Name: Frederic Paik Schoenberg Author-X-Name-First: Frederic Paik Author-X-Name-Last: Schoenberg Author-Name: Erin Murray Author-X-Name-First: Erin Author-X-Name-Last: Murray Author-Name: Philip W. Rundel Author-X-Name-First: Philip W. Author-X-Name-Last: Rundel Title: Application of Branching Models in the Study of Invasive Species Abstract: Earthquake occurrences are often described using a class of branching models called epidemic-type aftershock sequence (ETAS) models. The name derives from the fact that the model allows earthquakes to cause aftershocks, and then those aftershocks may induce subsequent aftershocks, and so on. Despite their value in seismology, such models have not previously been used in studying the incidence of invasive plant and animal species. Here, we apply ETAS models to study the spread of an invasive species in Costa Rica (Musa velutina, or red banana). One challenge in this ecological application is that fitting the model requires the originations of the plants, which are not observed but may be estimated using filed data on the heights of the plants on a given date and their empirical growth rates. We then characterize the estimated spatial-temporal rate of spread of red banana plants using a space-time ETAS model. Journal: Journal of the American Statistical Association Pages: 467-476 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2011.641402 File-URL: http://hdl.handle.net/10.1080/01621459.2011.641402 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:467-476 Template-Type: ReDIF-Article 1.0 Author-Name: Giseon Heo Author-X-Name-First: Giseon Author-X-Name-Last: Heo Author-Name: Jennifer Gamble Author-X-Name-First: Jennifer Author-X-Name-Last: Gamble Author-Name: Peter T. Kim Author-X-Name-First: Peter T. Author-X-Name-Last: Kim Title: Topological Analysis of Variance and the Maxillary Complex Abstract: It is common to reduce the dimensionality of data before applying classical multivariate analysis techniques in statistics. Persistent homology, a recent development in computational topology, has been shown to be useful for analyzing high-dimensional (nonlinear) data. In this article, we connect computational topology with the traditional analysis of variance and demonstrate the value of combining these approaches on a three-dimensional orthodontic landmark dataset derived from the maxillary complex. Indeed, combining appropriate techniques of both persistent homology and analysis of variance results in a better understanding of the data’s nonlinear features over and above what could have been achieved by classical means. Supplementary material for this article is available online. Journal: Journal of the American Statistical Association Pages: 477-492 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2011.641430 File-URL: http://hdl.handle.net/10.1080/01621459.2011.641430 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:477-492 Template-Type: ReDIF-Article 1.0 Author-Name: Lu Wang Author-X-Name-First: Lu Author-X-Name-Last: Wang Author-Name: Andrea Rotnitzky Author-X-Name-First: Andrea Author-X-Name-Last: Rotnitzky Author-Name: Xihong Lin Author-X-Name-First: Xihong Author-X-Name-Last: Lin Author-Name: Randall E. Millikan Author-X-Name-First: Randall E. Author-X-Name-Last: Millikan Author-Name: Peter F. Thall Author-X-Name-First: Peter F. Author-X-Name-Last: Thall Title: Evaluation of Viable Dynamic Treatment Regimes in a Sequentially Randomized Trial of Advanced Prostate Cancer Abstract: We present new statistical analyses of data arising from a clinical trial designed to compare two-stage dynamic treatment regimes (DTRs) for advanced prostate cancer. The trial protocol mandated that patients be initially randomized among four chemotherapies, and that those who responded poorly be re-randomized to one of the remaining candidate therapies. The primary aim was to compare the DTRs’ overall success rates, with success defined by the occurrence of successful responses in each of two consecutive courses of the patient’s therapy. Of the 150 study participants, 47 did not complete their therapy as per the algorithm. However, 35 of them did so for reasons that precluded further chemotherapy, that is, toxicity and/or progressive disease. Consequently, rather than comparing the overall success rates of the DTRs in the unrealistic event that these patients had remained on their assigned chemotherapies, we conducted an analysis that compared viable switch rules defined by the per-protocol rules but with the additional provision that patients who developed toxicity or progressive disease switch to a non-prespecified therapeutic or palliative strategy. This modification involved consideration of bivariate per-course outcomes encoding both efficacy and toxicity. We used numerical scores elicited from the trial’s principal investigator to quantify the clinical desirability of each bivariate per-course outcome, and defined one endpoint as their average over all courses of treatment. Two other simpler sets of scores as well as log survival time were also used as endpoints. Estimation of each DTR-specific mean score was conducted using inverse probability weighted methods that assumed that missingness in the 12 remaining dropouts was informative but explainable in that it only depended on past recorded data. We conducted additional worst- and best-case analyses to evaluate sensitivity of our findings to extreme departures from the explainable dropout assumption. Journal: Journal of the American Statistical Association Pages: 493-508 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2011.641416 File-URL: http://hdl.handle.net/10.1080/01621459.2011.641416 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:493-508 Template-Type: ReDIF-Article 1.0 Author-Name: Daniel Almirall Author-X-Name-First: Daniel Author-X-Name-Last: Almirall Author-Name: Daniel J. Lizotte Author-X-Name-First: Daniel J. Author-X-Name-Last: Lizotte Author-Name: Susan A. Murphy Author-X-Name-First: Susan A. Author-X-Name-Last: Murphy Title: Comment Journal: Journal of the American Statistical Association Pages: 509-512 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.665615 File-URL: http://hdl.handle.net/10.1080/01621459.2012.665615 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:509-512 Template-Type: ReDIF-Article 1.0 Author-Name: Paul Chaffee Author-X-Name-First: Paul Author-X-Name-Last: Chaffee Author-Name: Mark van der Laan Author-X-Name-First: Mark Author-X-Name-Last: van der Laan Title: Comment Journal: Journal of the American Statistical Association Pages: 513-517 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.665197 File-URL: http://hdl.handle.net/10.1080/01621459.2012.665197 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:513-517 Template-Type: ReDIF-Article 1.0 Author-Name: Lu Wang Author-X-Name-First: Lu Author-X-Name-Last: Wang Author-Name: Andrea Rotnitzky Author-X-Name-First: Andrea Author-X-Name-Last: Rotnitzky Author-Name: Xihong Lin Author-X-Name-First: Xihong Author-X-Name-Last: Lin Author-Name: Randall E. Millikan Author-X-Name-First: Randall E. Author-X-Name-Last: Millikan Author-Name: Peter F. Thall Author-X-Name-First: Peter F. Author-X-Name-Last: Thall Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 518-520 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.665198 File-URL: http://hdl.handle.net/10.1080/01621459.2012.665198 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:518-520 Template-Type: ReDIF-Article 1.0 Author-Name: Michael E. Sobel Author-X-Name-First: Michael E. Author-X-Name-Last: Sobel Title: Does Marriage Boost Men’s Wages?: Identification of Treatment Effects in Fixed Effects Regression Models for Panel Data Abstract: Social scientists have generated a large and inconclusive literature on the effect(s) of marriage on men’s wages. Researchers have hypothesized that the wage premium enjoyed by married men may reflect both a tendency for more productive men to marry and an effect of marriage on productivity. To sort out these explanations, researchers have used fixed effects regression models for panel data to adjust for selection on unobserved time-invariant confounders, interpreting coefficients for the time-varying marriage variables as effects. However, they did not define these effects or give conditions under which the regression coefficients would warrant a causal interpretation. Consequently, they failed to appropriately adjust for important time-varying confounders and misinterpreted their results. Regression models for panel data with unobserved time-invariant confounders are also widely used in many other policy-relevant contexts and the same problems arise there. This article draws on recent statistical work on causal inference with longitudinal data to clarify these problems and help researchers use appropriate methods to model their data. A basic set of treatment effects is defined and used to define derived effects. Causal models for panel data with unobserved time-invariant confounders are defined and the treatment effects are reexpressed in terms of these models. Ignorability conditions under which the parameters of the causal models are identified from the regression models are given. Even when these hold, a number of interesting and important treatment effects are typically not identified. Journal: Journal of the American Statistical Association Pages: 521-529 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2011.646917 File-URL: http://hdl.handle.net/10.1080/01621459.2011.646917 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:521-529 Template-Type: ReDIF-Article 1.0 Author-Name: Xi Luo Author-X-Name-First: Xi Author-X-Name-Last: Luo Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Author-Name: Chiang-Shan R. Li Author-X-Name-First: Chiang-Shan R. Author-X-Name-Last: Li Author-Name: Paul R. Rosenbaum Author-X-Name-First: Paul R. Author-X-Name-Last: Rosenbaum Title: Inference With Interference Between Units in an fMRI Experiment of Motor Inhibition Abstract: An experimental unit is an opportunity to randomly apply or withhold a treatment. There is interference between units if the application of the treatment to one unit may also affect other units. In cognitive neuroscience, a common form of experiment presents a sequence of stimuli or requests for cognitive activity at random to each experimental subject and measures biological aspects of brain activity that follow these requests. Each subject is then many experimental units, and interference between units within an experimental subject is, likely, in part because the stimuli follow one another quickly and in part because human subjects learn or become experienced or primed or bored as the experiment proceeds. We use a recent functional magnetic resonance imaging (fMRI) experiment concerned with the inhibition of motor activity to illustrate and further develop recently proposed methodology for inference in the presence of interference. A simulation evaluates the power of competing procedures. Journal: Journal of the American Statistical Association Pages: 530-541 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.655954 File-URL: http://hdl.handle.net/10.1080/01621459.2012.655954 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:530-541 Template-Type: ReDIF-Article 1.0 Author-Name: Li Li Author-X-Name-First: Li Author-X-Name-Last: Li Author-Name: Joseph J. Eron Author-X-Name-First: Joseph J. Author-X-Name-Last: Eron Author-Name: Heather Ribaudo Author-X-Name-First: Heather Author-X-Name-Last: Ribaudo Author-Name: Roy M. Gulick Author-X-Name-First: Roy M. Author-X-Name-Last: Gulick Author-Name: Brent A. Johnson Author-X-Name-First: Brent A. Author-X-Name-Last: Johnson Title: Evaluating the Effect of Early Versus Late ARV Regimen Change if Failure on an Initial Regimen: Results From the AIDS Clinical Trials Group Study A5095 Abstract: The current goal of initial antiretroviral (ARV) therapy is suppression of plasma human immunodeficiency virus (HIV)-1 RNA levels to below 200 copies per milliliter. A proportion of HIV-infected patients who initiate antiretroviral therapy in clinical practice or antiretroviral clinical trials either fail to suppress HIV-1 RNA or have HIV-1 RNA levels rebound on therapy. Frequently, these patients have sustained CD4 cell counts responses and limited or no clinical symptoms and, therefore, have potentially limited indications for altering therapy which they may be tolerating well despite increased viral replication. On the other hand, increased viral replication on therapy leads to selection of resistance mutations to the antiretroviral agents comprising their therapy and potentially cross-resistance to other agents in the same class decreasing the likelihood of response to subsequent antiretroviral therapy. The optimal time to switch antiretroviral therapy to ensure sustained virologic suppression and prevent clinical events in patients who have rebound in their HIV-1 RNA, yet are stable, is not known. Randomized clinical trials to compare early versus delayed switching have been difficult to design and more difficult to enroll. In some clinical trials, such as the AIDS Clinical Trials Group (ACTG) Study A5095, patients randomized to initial antiretroviral treatment combinations, who fail to suppress HIV-1 RNA or have a rebound of HIV-1 RNA on therapy are allowed to switch from the initial ARV regimen to a new regimen, based on clinician and patient decisions. We delineate a statistical framework to estimate the effect of early versus late regimen change using data from ACTG A5095 in the context of two-stage designs. In causal inference, a large class of doubly robust estimators are derived through semiparametric theory with applications to missing data problems. This class of estimators is motivated through geometric arguments and relies on large samples for good performance. By now, several authors have noted that a doubly robust estimator may be suboptimal when the outcome model is misspecified even if it is semiparametric efficient when the outcome regression model is correctly specified. Through auxiliary variables, two-stage designs, and within the contextual backdrop of our scientific problem and clinical study, we propose improved doubly robust, locally efficient estimators of a population mean and average causal effect for early versus delayed switching to second-line ARV treatment regimens. Our analysis of the ACTG A5095 data further demonstrates how methods that use auxiliary variables can improve over methods that ignore them. Using the methods developed here, we conclude that patients who switch within 8 weeks of virologic failure have better clinical outcomes, on average, than patients who delay switching to a new second-line ARV regimen after failing on the initial regimen. Ordinary statistical methods fail to find such differences. This article has online supplementary material. Journal: Journal of the American Statistical Association Pages: 542-554 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2011.646932 File-URL: http://hdl.handle.net/10.1080/01621459.2011.646932 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:542-554 Template-Type: ReDIF-Article 1.0 Author-Name: Dulal K. Bhaumik Author-X-Name-First: Dulal K. Author-X-Name-Last: Bhaumik Author-Name: Anup Amatya Author-X-Name-First: Anup Author-X-Name-Last: Amatya Author-Name: Sharon-Lise T. Normand Author-X-Name-First: Sharon-Lise T. Author-X-Name-Last: Normand Author-Name: Joel Greenhouse Author-X-Name-First: Joel Author-X-Name-Last: Greenhouse Author-Name: Eloise Kaizar Author-X-Name-First: Eloise Author-X-Name-Last: Kaizar Author-Name: Brian Neelon Author-X-Name-First: Brian Author-X-Name-Last: Neelon Author-Name: Robert D. Gibbons Author-X-Name-First: Robert D. Author-X-Name-Last: Gibbons Title: Meta-Analysis of Rare Binary Adverse Event Data Abstract: We examine the use of fixed-effects and random-effects moment-based meta-analytic methods for analysis of binary adverse-event data. Special attention is paid to the case of rare adverse events that are commonly encountered in routine practice. We study estimation of model parameters and between-study heterogeneity. In addition, we examine traditional approaches to hypothesis testing of the average treatment effect and detection of the heterogeneity of treatment effect across studies. We derive three new methods, a simple (unweighted) average treatment effect estimator, a new heterogeneity estimator, and a parametric bootstrapping test for heterogeneity. We then study the statistical properties of both the traditional and the new methods via simulation. We find that in general, moment-based estimators of combined treatment effects and heterogeneity are biased and the degree of bias is proportional to the rarity of the event under study. The new methods eliminate much, but not all, of this bias. The various estimators and hypothesis testing methods are then compared and contrasted using an example dataset on treatment of stable coronary artery disease. Journal: Journal of the American Statistical Association Pages: 555-567 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.664484 File-URL: http://hdl.handle.net/10.1080/01621459.2012.664484 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:555-567 Template-Type: ReDIF-Article 1.0 Author-Name: Hakmook Kang Author-X-Name-First: Hakmook Author-X-Name-Last: Kang Author-Name: Hernando Ombao Author-X-Name-First: Hernando Author-X-Name-Last: Ombao Author-Name: Crystal Linkletter Author-X-Name-First: Crystal Author-X-Name-Last: Linkletter Author-Name: Nicole Long Author-X-Name-First: Nicole Author-X-Name-Last: Long Author-Name: David Badre Author-X-Name-First: David Author-X-Name-Last: Badre Title: Spatio-Spectral Mixed-Effects Model for Functional Magnetic Resonance Imaging Data Abstract: The goal of this article is to model cognitive control related activation among predefined regions of interest (ROIs) of the human brain while properly adjusting for the underlying spatio-temporal correlations. Standard approaches to fMRI analysis do not simultaneously take into account both the spatial and temporal correlations that are prevalent in fMRI data. This is primarily due to the computational complexity of estimating the spatio-temporal covariance matrix. More specifically, they do not take into account multiscale spatial correlation (between-ROIs and within-ROI). To address these limitations, we propose a spatio-spectral mixed-effects model. Working in the spectral domain simplifies the temporal covariance structure because the Fourier coefficients are approximately uncorrelated across frequencies. Additionally, by incorporating voxel-specific and ROI-specific random effects, the model is able to capture the multiscale spatial covariance structure: distance-dependent local correlation (within an ROI), and distance-independent global correlation (between-ROIs). Building on existing theory on linear mixed-effects models to conduct estimation and inference, we applied our model to fMRI data to study activation in prespecified ROIs in the prefontal cortex and estimate the correlation structure in the network. Simulation studies demonstrate that ignoring the multiscale correlation leads to higher false positive error rates. Journal: Journal of the American Statistical Association Pages: 568-577 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.664503 File-URL: http://hdl.handle.net/10.1080/01621459.2012.664503 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:568-577 Template-Type: ReDIF-Article 1.0 Author-Name: Thomas Barrios Author-X-Name-First: Thomas Author-X-Name-Last: Barrios Author-Name: Rebecca Diamond Author-X-Name-First: Rebecca Author-X-Name-Last: Diamond Author-Name: Guido W. Imbens Author-X-Name-First: Guido W. Author-X-Name-Last: Imbens Author-Name: Michal Kolesár Author-X-Name-First: Michal Author-X-Name-Last: Kolesár Title: Clustering, Spatial Correlations, and Randomization Inference Abstract: It is a standard practice in regression analyses to allow for clustering in the error covariance matrix if the explanatory variable of interest varies at a more aggregate level (e.g., the state level) than the units of observation (e.g., individuals). Often, however, the structure of the error covariance matrix is more complex, with correlations not vanishing for units in different clusters. Here, we explore the implications of such correlations for the actual and estimated precision of least squares estimators. Our main theoretical result is that with equal-sized clusters, if the covariate of interest is randomly assigned at the cluster level, only accounting for nonzero covariances at the cluster level, and ignoring correlations between clusters as well as differences in within-cluster correlations, leads to valid confidence intervals. However, in the absence of random assignment of the covariates, ignoring general correlation structures may lead to biases in standard errors. We illustrate our findings using the 5% public-use census data. Based on these results, we recommend that researchers, as a matter of routine, explore the extent of spatial correlations in explanatory variables beyond state-level clustering. Journal: Journal of the American Statistical Association Pages: 578-591 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.682524 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682524 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:578-591 Template-Type: ReDIF-Article 1.0 Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Jingjin Zhang Author-X-Name-First: Jingjin Author-X-Name-Last: Zhang Author-Name: Ke Yu Author-X-Name-First: Ke Author-X-Name-Last: Yu Title: Vast Portfolio Selection With Gross-Exposure Constraints Abstract: This article introduces the large portfolio selection using gross-exposure constraints. It shows that with gross-exposure constraints, the empirically selected optimal portfolios based on estimated covariance matrices have similar performance to the theoretical optimal ones and there is no error accumulation effect from estimation of vast covariance matrices. This gives theoretical justification to the empirical results by Jagannathan and Ma. It also shows that the no-short-sale portfolio can be improved by allowing some short positions. The applications to portfolio selection, tracking, and improvements are also addressed. The utility of our new approach is illustrated by simulation and empirical studies on the 100 Fama--French industrial portfolios and the 600 stocks randomly selected from Russell 3000. Journal: Journal of the American Statistical Association Pages: 592-606 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.682825 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682825 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:592-606 Template-Type: ReDIF-Article 1.0 Author-Name: Alessandra Luati Author-X-Name-First: Alessandra Author-X-Name-Last: Luati Author-Name: Tommaso Proietti Author-X-Name-First: Tommaso Author-X-Name-Last: Proietti Author-Name: Marco Reale Author-X-Name-First: Marco Author-X-Name-Last: Reale Title: The Variance Profile Abstract: The variance profile is defined as the power mean of the spectral density function of a stationary stochastic process. It is a continuous and nondecreasing function of the power parameter, p, which returns the minimum of the spectrum (p→−∞), the interpolation error variance (harmonic mean, p=−1), the prediction error variance (geometric mean, p=0), the unconditional variance (arithmetic mean, p=1), and the maximum of the spectrum (p→∞). The variance profile provides a useful characterization of a stochastic process; we focus in particular on the class of fractionally integrated processes. Moreover, it enables a direct and immediate derivation of the Szegö-Kolmogorov formula and the interpolation error variance formula. The article proposes a nonparametric estimator of the variance profile based on the power mean of the smoothed sample spectrum, and proves its consistency and its asymptotic normality. From the empirical standpoint, we propose and illustrate the use of the variance profile for estimating the long memory parameter in climatological and financial time series and for assessing structural change. Journal: Journal of the American Statistical Association Pages: 607-621 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.682832 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682832 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:607-621 Template-Type: ReDIF-Article 1.0 Author-Name: Viktor Todorov Author-X-Name-First: Viktor Author-X-Name-Last: Todorov Author-Name: George Tauchen Author-X-Name-First: George Author-X-Name-Last: Tauchen Title: Inverse Realized Laplace Transforms for Nonparametric Volatility Density Estimation in Jump-Diffusions Abstract: This article develops a nonparametric estimator of the stochastic volatility density of a discretely observed Itô semimartingale in the setting of an increasing time span and finer mesh of the observation grid. There are two basic steps involved. The first step is aggregating the high-frequency increments into the realized Laplace transform, which is a robust nonparametric estimate of the underlying volatility Laplace transform. The second step is using a regularized kernel to invert the realized Laplace transform. These two steps are relatively quick and easy to compute, so the nonparametric estimator is practicable. The article also derives bounds for the mean squared error of the estimator. The regularity conditions are sufficiently general to cover empirically important cases such as level jumps and possible dependencies between volatility moves and either diffusive or jump moves in the semimartingale. The Monte Carlo analysis in this study indicates that the nonparametric estimator is reliable and reasonably accurate in realistic estimation contexts. An empirical application to 5-min data for three large-cap stocks, 1997--2010, reveals the importance of big short-term volatility spikes in generating high levels of stock price variability over and above those induced by price jumps. The application also shows how to trace out the dynamic response of the volatility density to both positive and negative jumps in the stock price. Journal: Journal of the American Statistical Association Pages: 622-635 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.682854 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682854 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:622-635 Template-Type: ReDIF-Article 1.0 Author-Name: James O. Berger Author-X-Name-First: James O. Author-X-Name-Last: Berger Author-Name: Jose M. Bernardo Author-X-Name-First: Jose M. Author-X-Name-Last: Bernardo Author-Name: Dongchu Sun Author-X-Name-First: Dongchu Author-X-Name-Last: Sun Title: Objective Priors for Discrete Parameter Spaces Abstract: This article considers the development of objective prior distributions for discrete parameter spaces. Formal approaches to such development—such as the reference prior approach—often result in a constant prior for a discrete parameter, which is questionable for problems that exhibit certain types of structure. To take advantage of structure, this article proposes embedding the original problem in a continuous problem that preserves the structure, and then using standard reference prior theory to determine the appropriate objective prior. Four different possibilities for this embedding are explored, and applied to a population-size model, the hypergeometric distribution, the multivariate hypergeometric distribution, the binomial-beta distribution, and the binomial distribution. The recommended objective priors for the first, third, and fourth problems are new. Journal: Journal of the American Statistical Association Pages: 636-648 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.682538 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682538 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:636-648 Template-Type: ReDIF-Article 1.0 Author-Name: Valen E. Johnson Author-X-Name-First: Valen E. Author-X-Name-Last: Johnson Author-Name: David Rossell Author-X-Name-First: David Author-X-Name-Last: Rossell Title: Bayesian Model Selection in High-Dimensional Settings Abstract: Standard assumptions incorporated into Bayesian model selection procedures result in procedures that are not competitive with commonly used penalized likelihood methods. We propose modifications of these methods by imposing nonlocal prior densities on model parameters. We show that the resulting model selection procedures are consistent in linear model settings when the number of possible covariates p is bounded by the number of observations n, a property that has not been extended to other model selection procedures. In addition to consistently identifying the true model, the proposed procedures provide accurate estimates of the posterior probability that each identified model is correct. Through simulation studies, we demonstrate that these model selection procedures perform as well or better than commonly used penalized likelihood methods in a range of simulation settings. Proofs of the primary theorems are provided in the Supplementary Material that is available online. Journal: Journal of the American Statistical Association Pages: 649-660 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.682536 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682536 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:649-660 Template-Type: ReDIF-Article 1.0 Author-Name: Peter Hall Author-X-Name-First: Peter Author-X-Name-Last: Hall Author-Name: Michael G. Schimek Author-X-Name-First: Michael G. Author-X-Name-Last: Schimek Title: Moderate-Deviation-Based Inference for Random Degeneration in Paired Rank Lists Abstract: Consider a problem where N items (objects or individuals) are judged by assessors using their perceptions of a set of performance criteria, or alternatively by technical devices. In particular, two assessors might rank the items between 1 and N on the basis of relative performance, independently of each other. We can aggregate the rank lists by assigning one if the two assessors agree, and zero otherwise, and we can modify this approach to make it robust against irregularities. In this article, we consider methods and algorithms that can be used to address this problem. We study their theoretical properties in the case of a model based on nonstationary Bernoulli trials, and we report on their numerical properties for both simulated and real data. Journal: Journal of the American Statistical Association Pages: 661-672 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.682539 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682539 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:661-672 Template-Type: ReDIF-Article 1.0 Author-Name: Wenguang Sun Author-X-Name-First: Wenguang Author-X-Name-Last: Sun Author-Name: Alexander C. McLain Author-X-Name-First: Alexander C. Author-X-Name-Last: McLain Title: Multiple Testing of Composite Null Hypotheses in Heteroscedastic Models Abstract: In large-scale studies, the true effect sizes often range continuously from zero to small to large, and are observed with heteroscedastic errors. In practical situations where the failure to reject small deviations from the null is inconsequential, specifying an indifference region (or forming composite null hypotheses) can greatly reduce the number of unimportant discoveries in multiple testing. The heteroscedasticity issue poses new challenges for multiple testing with composite nulls. In particular, the conventional framework in multiple testing, which involves rescaling or standardization, is likely to distort the scientific question. We propose the concept of a composite null distribution for heteroscedastic models and develop an optimal testing procedure that minimizes the false nondiscovery rate, subject to a constraint on the false discovery rate. The proposed approach is different from conventional methods in that the effect size, statistical significance, and multiplicity issues are addressed integrally. The external information of heteroscedastic errors is incorporated for optimal simultaneous inference. The new features and advantages of our approach are demonstrated using both simulated and real data. The numerical studies demonstrate that our new procedure enjoys superior performance with greater accuracy and better interpretability of results. Journal: Journal of the American Statistical Association Pages: 673-687 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.664505 File-URL: http://hdl.handle.net/10.1080/01621459.2012.664505 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:673-687 Template-Type: ReDIF-Article 1.0 Author-Name: Liuquan Sun Author-X-Name-First: Liuquan Author-X-Name-Last: Sun Author-Name: Xinyuan Song Author-X-Name-First: Xinyuan Author-X-Name-Last: Song Author-Name: Jie Zhou Author-X-Name-First: Jie Author-X-Name-Last: Zhou Author-Name: Lei Liu Author-X-Name-First: Lei Author-X-Name-Last: Liu Title: Joint Analysis of Longitudinal Data With Informative Observation Times and a Dependent Terminal Event Abstract: In many longitudinal studies, repeated measures are often correlated with observation times. Also, there may exist a dependent terminal event such as death that stops the follow-up. In this article, we propose a new joint model for the analysis of longitudinal data in the presence of both informative observation times and a dependent terminal event via latent variables. Estimating equation approaches are developed for parameter estimation, and the resulting estimators are shown to be consistent and asymptotically normal. In addition, some graphical and numerical procedures are presented for model checking. Simulation studies demonstrate that the proposed method performs well for practical settings. An application to a medical cost study of chronic heart failure patients from the University of Virginia Health System is provided. Journal: Journal of the American Statistical Association Pages: 688-700 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.682528 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682528 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:688-700 Template-Type: ReDIF-Article 1.0 Author-Name: Jianhui Zhou Author-X-Name-First: Jianhui Author-X-Name-Last: Zhou Author-Name: Annie Qu Author-X-Name-First: Annie Author-X-Name-Last: Qu Title: Informative Estimation and Selection of Correlation Structure for Longitudinal Data Abstract: Identifying an informative correlation structure is important in improving estimation efficiency for longitudinal data. We approximate the empirical estimator of the correlation matrix by groups of known basis matrices that represent different correlation structures, and transform the correlation structure selection problem to a covariate selection problem. To address both the complexity and the informativeness of the correlation matrix, we minimize an objective function that consists of two parts: the difference between the empirical information and a model approximation of the correlation matrix, and a penalty that penalizes models with too many basis matrices. The unique feature of the proposed estimation and selection of correlation structure is that it does not require the specification of the likelihood function, and therefore it is applicable for discrete longitudinal data. We carry out the proposed method through a groupwise penalty strategy, which is able to identify more complex structures. The proposed method possesses the oracle property and selects the true correlation structure consistently. In addition, the estimator of the correlation parameters follows a normal distribution asymptotically. Simulation studies and a data example confirm that the proposed method works effectively in estimating and selecting the true structure in finite samples, and it enables improvement in estimation efficiency by selecting the true structures. Journal: Journal of the American Statistical Association Pages: 701-710 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.682534 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682534 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:701-710 Template-Type: ReDIF-Article 1.0 Author-Name: Mian Huang Author-X-Name-First: Mian Author-X-Name-Last: Huang Author-Name: Weixin Yao Author-X-Name-First: Weixin Author-X-Name-Last: Yao Title: Mixture of Regression Models With Varying Mixing Proportions: A Semiparametric Approach Abstract: In this article, we study a class of semiparametric mixtures of regression models, in which the regression functions are linear functions of the predictors, but the mixing proportions are smoothing functions of a covariate. We propose a one-step backfitting estimation procedure to achieve the optimal convergence rates for both regression parameters and the nonparametric functions of mixing proportions. We derive the asymptotic bias and variance of the one-step estimate, and further establish its asymptotic normality. A modified expectation-maximization-type (EM-type) estimation procedure is investigated. We show that the modified EM algorithms preserve the asymptotic ascent property. Numerical simulations are conducted to examine the finite sample performance of the estimation procedures. The proposed methodology is further illustrated via an analysis of a real dataset. Journal: Journal of the American Statistical Association Pages: 711-724 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.682541 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682541 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:711-724 Template-Type: ReDIF-Article 1.0 Author-Name: Peng Wang Author-X-Name-First: Peng Author-X-Name-Last: Wang Author-Name: Guei-feng Tsai Author-X-Name-First: Guei-feng Author-X-Name-Last: Tsai Author-Name: Annie Qu Author-X-Name-First: Annie Author-X-Name-Last: Qu Title: Conditional Inference Functions for Mixed-Effects Models With Unspecified Random-Effects Distribution Abstract: In longitudinal studies, mixed-effects models are important for addressing subject-specific effects. However, most existing approaches assume a normal distribution for the random effects, and this could affect the bias and efficiency of the fixed-effects estimator. Even in cases where the estimation of the fixed effects is robust with a misspecified distribution of the random effects, the estimation of the random effects could be invalid. We propose a new approach to estimate fixed and random effects using conditional quadratic inference functions (QIFs). The new approach does not require the specification of likelihood functions or a normality assumption for random effects. It can also accommodate serial correlation between observations within the same cluster, in addition to mixed-effects modeling. Other advantages include not requiring the estimation of the unknown variance components associated with the random effects, or the nuisance parameters associated with the working correlations. We establish asymptotic results for the fixed-effect parameter estimators that do not rely on the consistency of the random-effect estimators. Real data examples and simulations are used to compare the new approach with the penalized quasi-likelihood (PQL) approach, and SAS GLIMMIX and nonlinear mixed-effects model (NLMIXED) procedures. Supplemental materials including technical details are available online. Journal: Journal of the American Statistical Association Pages: 725-736 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.665199 File-URL: http://hdl.handle.net/10.1080/01621459.2012.665199 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:725-736 Template-Type: ReDIF-Article 1.0 Author-Name: Jun Li Author-X-Name-First: Jun Author-X-Name-Last: Li Author-Name: Juan A. Cuesta-Albertos Author-X-Name-First: Juan A. Author-X-Name-Last: Cuesta-Albertos Author-Name: Regina Y. Liu Author-X-Name-First: Regina Y. Author-X-Name-Last: Liu Title: DD-Classifier: Nonparametric Classification Procedure Based on DD-Plot Abstract: Using the DD-plot (depth vs. depth plot), we introduce a new nonparametric classification algorithm and call it DD-classifier. The algorithm is completely nonparametric, and it requires no prior knowledge of the underlying distributions or the form of the separating curve. Thus, it can be applied to a wide range of classification problems. The algorithm is completely data driven and its classification outcome can be easily visualized in a two-dimensional plot regardless of the dimension of the data. Moreover, it has the advantage of bypassing the estimation of underlying parameters such as means and scales, which is often required by the existing classification procedures. We study the asymptotic properties of the DD-classifier and its misclassification rate. Specifically, we show that DD-classifier is asymptotically equivalent to the Bayes rule under suitable conditions, and it can achieve Bayes error for a family broader than elliptical distributions. The performance of the classifier is also examined using simulated and real datasets. Overall, the DD-classifier performs well across a broad range of settings, and compares favorably with existing classifiers. It can also be robust against outliers or contamination. Journal: Journal of the American Statistical Association Pages: 737-753 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.688462 File-URL: http://hdl.handle.net/10.1080/01621459.2012.688462 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:737-753 Template-Type: ReDIF-Article 1.0 Author-Name: Ute Hahn Author-X-Name-First: Ute Author-X-Name-Last: Hahn Title: A Studentized Permutation Test for the Comparison of Spatial Point Patterns Abstract: In this study, a new test is proposed for the hypothesis that two (or more) observed point patterns are realizations of the same spatial point process model. To this end, the point patterns are divided into disjoint quadrats, on each of which an estimate of Ripley’s K-function is calculated. The two groups of empirical K-functions are compared by a permutation test using a Studentized test statistic. The proposed test performs convincingly in terms of empirical level and power in a simulation study, even for point patterns where the K-function estimates on neighboring subsamples are not strictly exchangeable. It also shows improved behavior compared with a test suggested by Diggle et al. for the comparison of groups of independently replicated point patterns. In an application to two point patterns from pathology that represent capillary positions in sections of healthy and cancerous tissue, our Studentized permutation test indicates statistical significance, although the patterns cannot be clearly distinguished by the eye. Journal: Journal of the American Statistical Association Pages: 754-764 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.688463 File-URL: http://hdl.handle.net/10.1080/01621459.2012.688463 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:754-764 Template-Type: ReDIF-Article 1.0 Author-Name: Ta-Hsin Li Author-X-Name-First: Ta-Hsin Author-X-Name-Last: Li Title: Quantile Periodograms Abstract: Two periodogram-like functions, called quantile periodograms, are introduced for spectral analysis of time series. The quantile periodograms are constructed from trigonometric quantile regression and motivated by different interpretations of the ordinary periodogram. Analytical and numerical results demonstrate the capability of the quantile periodograms for detecting hidden periodicity in the quantiles and for providing an additional view of time-series data. A connection between the quantile periodograms and the so-called level-crossing spectrum is established through an asymptotic analysis. Journal: Journal of the American Statistical Association Pages: 765-776 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.682815 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682815 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:765-776 Template-Type: ReDIF-Article 1.0 Author-Name: Thomas J. Fisher Author-X-Name-First: Thomas J. Author-X-Name-Last: Fisher Author-Name: Colin M. Gallagher Author-X-Name-First: Colin M. Author-X-Name-Last: Gallagher Title: New Weighted Portmanteau Statistics for Time Series Goodness of Fit Testing Abstract: We exploit ideas from high-dimensional data analysis to derive new portmanteau tests that are based on the trace of the square of the mth order autocorrelation matrix. The resulting statistics are weighted sums of the squares of the sample autocorrelation coefficients that, unlike many other tests appearing in the literature, are numerically stable even when the number of lags considered is relatively close to the sample size. The statistics behave asymptotically as a linear combination of chi-squared random variables and their asymptotic distribution can be approximated by a gamma distribution. The proposed tests are modified to check for nonlinearity and to check the adequacy of a fitted nonlinear model. Simulation evidence indicates that the proposed goodness of fit tests tend to have higher power than other tests appearing in the literature, particularly in detecting long-memory nonlinear models. The efficacy of the proposed methods is demonstrated by investigating nonlinear effects in Apple, Inc., and Nikkei-300 daily returns during the 2006--2007 calendar years. The supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 777-787 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.688465 File-URL: http://hdl.handle.net/10.1080/01621459.2012.688465 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:777-787 Template-Type: ReDIF-Article 1.0 Author-Name: Christopher R. Genovese Author-X-Name-First: Christopher R. Author-X-Name-Last: Genovese Author-Name: Marco Perone-Pacifico Author-X-Name-First: Marco Author-X-Name-Last: Perone-Pacifico Author-Name: Isabella Verdinelli Author-X-Name-First: Isabella Author-X-Name-Last: Verdinelli Author-Name: Larry Wasserman Author-X-Name-First: Larry Author-X-Name-Last: Wasserman Title: The Geometry of Nonparametric Filament Estimation Abstract: We consider the problem of estimating filamentary structure from d-dimensional point process data. We make some connections with computational geometry and develop nonparametric methods for estimating the filaments. We show that, under weak conditions, the filaments have a simple geometric representation as the medial axis of the data distribution’s support. Our methods convert an estimator of the support’s boundary into an estimator of the filaments. We also find the rates of convergence of our estimators. Proofs of all results are in the supplementary material available online. Journal: Journal of the American Statistical Association Pages: 788-799 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.682527 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682527 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:788-799 Template-Type: ReDIF-Article 1.0 Author-Name: Sylvain Sardy Author-X-Name-First: Sylvain Author-X-Name-Last: Sardy Title: Smooth Blockwise Iterative Thresholding: A Smooth Fixed Point Estimator Based on the Likelihood’s Block Gradient Abstract: The proposed smooth blockwise iterative thresholding estimator (SBITE) is a model selection technique defined as a fixed point reached by iterating a likelihood gradient-based thresholding function. The smooth James--Stein thresholding function has two regularization parameters λ and ν, and a smoothness parameter s. It enjoys smoothness like ridge regression and selects variables like lasso. Focusing on Gaussian regression, we show that SBITE is uniquely defined, and that its Stein unbiased risk estimate is a smooth function of λ and ν, for better selection of the two regularization parameters. We perform a Monte Carlo simulation to investigate the predictive and oracle properties of this smooth version of adaptive lasso. The motivation is a gravitational wave burst detection problem from several concomitant time series. A nonparametric wavelet-based estimator is developed to combine information from all captors by block-thresholding multiresolution coefficients. We study how the smoothness parameter s tempers the erraticity of the risk estimate, and derives a universal threshold, an information criterion, and an oracle inequality in this canonical setting. Journal: Journal of the American Statistical Association Pages: 800-813 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.664527 File-URL: http://hdl.handle.net/10.1080/01621459.2012.664527 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:800-813 Template-Type: ReDIF-Article 1.0 Author-Name: Daniel Percival Author-X-Name-First: Daniel Author-X-Name-Last: Percival Title: Structured, Sparse Aggregation Abstract: This article introduces a method for aggregating many least-squares estimators so that the resulting estimate has two properties: sparsity and structure. That is, only a few candidate covariates are used in the resulting model, and the selected covariates follow some structure over the candidate covariates that is assumed to be known a priori. Although sparsity is well studied in many settings, including aggregation, structured sparse methods are still emerging. We demonstrate a general framework for structured sparse aggregation that allows for a wide variety of structures, including overlapping grouped structures and general structural penalties defined as set functions on the set of covariates. We show that such estimators satisfy structured sparse oracle inequalities—their finite sample risk adapts to the structured sparsity of the target. These inequalities reveal that under suitable settings, the structured sparse estimator performs at least as well as, and potentially much better than, a sparse aggregation estimator. We empirically establish the effectiveness of the method using simulation and an application to HIV drug resistance. Journal: Journal of the American Statistical Association Pages: 814-823 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.682542 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682542 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:814-823 Template-Type: ReDIF-Article 1.0 Author-Name: W. Brannath Author-X-Name-First: W. Author-X-Name-Last: Brannath Author-Name: G. Gutjahr Author-X-Name-First: G. Author-X-Name-Last: Gutjahr Author-Name: P. Bauer Author-X-Name-First: P. Author-X-Name-Last: Bauer Title: Probabilistic Foundation of Confirmatory Adaptive Designs Abstract: Adaptive designs allow the investigator of a confirmatory trial to react to unforeseen developments by changing the design. This broad flexibility comes at the price of a complex statistical model where important components, such as the adaptation rule, remain unspecified. It has thus been doubted whether Type I error control can be guaranteed in general adaptive designs. This criticism is fully justified as long as the probabilistic framework on which an adaptive design is based remains vague and implicit. Therefore, an indispensable step lies in the clarification of the probabilistic fundamentals of adaptive testing. We demonstrate that the two main principles of adaptive designs, namely the conditional Type I error rate and the conditional invariance principle, will provide Type I error rate control, if the conditional distribution of the second-stage data, given the first-stage data, can be described in terms of a regression model. A similar assumption is required for regression analysis where the distribution of the covariates is a nuisance parameter and the model needs to be identifiable independently from the covariate distribution. We further show that under the assumption of a regression model, the events of an arbitrary adaptive design can be embedded into a formal probability space without the need of posing any restrictions on the adaptation rule. As a consequence of our results, artificial constraints that had to be imposed on the investigator only for mathematical tractability of the model are no longer necessary. Journal: Journal of the American Statistical Association Pages: 824-832 Issue: 498 Volume: 107 Year: 2011 Month: 6 X-DOI: 10.1080/01621459.2012.682540 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682540 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2011:i:498:p:824-832 Template-Type: ReDIF-Article 1.0 Author-Name: Alberto Abadie Author-X-Name-First: Alberto Author-X-Name-Last: Abadie Author-Name: Guido W. Imbens Author-X-Name-First: Guido W. Author-X-Name-Last: Imbens Title: A Martingale Representation for Matching Estimators Abstract: Matching estimators are widely used in statistical data analysis. However, the large sample distribution of matching estimators has been derived only for particular cases. This article establishes a martingale representation for matching estimators. This representation allows the use of martingale limit theorems to derive the large sample distribution of matching estimators. As an illustration of the applicability of the theory, we derive the asymptotic distribution of a matching estimator when matching is carried out without replacement, a result previously unavailable in the literature. In addition, we apply the techniques proposed in this article to derive a correction to the standard error of a sample mean when missing data are imputed using the “hot deck,” a matching imputation method widely used in the Current Population Survey (CPS) and other large surveys in the social sciences. We demonstrate the empirical relevance of our methods using two Monte Carlo designs based on actual datasets. In these Monte Carlo exercises, the large sample distribution of matching estimators derived in this article provides an accurate approximation to the small sample behavior of these estimators. In addition, our simulations show that standard errors that do not take into account hot-deck imputation of missing data may be severely downward biased, while standard errors that incorporate the correction for hot-deck imputation perform extremely well. This article has online supplementary materials. Journal: Journal of the American Statistical Association Pages: 833-843 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.682537 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682537 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:833-843 Template-Type: ReDIF-Article 1.0 Author-Name: Pierre Perron Author-X-Name-First: Pierre Author-X-Name-Last: Perron Author-Name: Tomoyoshi Yabu Author-X-Name-First: Tomoyoshi Author-X-Name-Last: Yabu Title: Testing for Trend in the Presence of Autoregressive Error: A Comment Journal: Journal of the American Statistical Association Pages: 844-844 Issue: 498 Volume: 107 Year: 2012 Month: 6 X-DOI: 10.1080/01621459.2012.668638 File-URL: http://hdl.handle.net/10.1080/01621459.2012.668638 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:844-844 Template-Type: ReDIF-Article 1.0 Author-Name: Ioanna Manolopoulou Author-X-Name-First: Ioanna Author-X-Name-Last: Manolopoulou Author-Name: Melanie P. Matheu Author-X-Name-First: Melanie P. Author-X-Name-Last: Matheu Author-Name: Michael D. Cahalan Author-X-Name-First: Michael D. Author-X-Name-Last: Cahalan Author-Name: Mike West Author-X-Name-First: Mike Author-X-Name-Last: West Author-Name: Thomas B. Kepler Author-X-Name-First: Thomas B. Author-X-Name-Last: Kepler Title: Bayesian Spatio-Dynamic Modeling in Cell Motility Studies: Learning Nonlinear Taxic Fields Guiding the Immune Response Abstract: We develop and analyze models of the spatio-temporal organization of lymphocytes in the lymph nodes and spleen. The spatial dynamics of these immune system white blood cells are influenced by biochemical fields and represent key components of the overall immune response to vaccines and infections. A primary goal is to learn about the structure of these fields that fundamentally shape the immune response. We define dynamic models of single-cell motion involving nonparametric representations of scalar potential fields underlying the directional biochemical fields that guide cellular motion. Bayesian hierarchical extensions define multicellular models for aggregating models and data on colonies of cells. Analysis via customized Markov chain Monte Carlo methods leads to Bayesian inference on cell-specific and population parameters together with the underlying spatial fields. Our case study explores data from multiphoton intravital microscopy in lymph nodes of mice, and we use a number of visualization tools to summarize and compare posterior inferences on the three-dimensional taxic fields. Journal: Journal of the American Statistical Association Pages: 855-865 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.655995 File-URL: http://hdl.handle.net/10.1080/01621459.2012.655995 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:855-865 Template-Type: ReDIF-Article 1.0 Author-Name: Laura A. Hatfield Author-X-Name-First: Laura A. Author-X-Name-Last: Hatfield Author-Name: Mark E. Boye Author-X-Name-First: Mark E. Author-X-Name-Last: Boye Author-Name: Michelle D. Hackshaw Author-X-Name-First: Michelle D. Author-X-Name-Last: Hackshaw Author-Name: Bradley P. Carlin Author-X-Name-First: Bradley P. Author-X-Name-Last: Carlin Title: Multilevel Bayesian Models for Survival Times and Longitudinal Patient-Reported Outcomes With Many Zeros Abstract: Regulatory approval of new therapies often depends on demonstrating prolonged survival. Particularly when these survival benefits are modest, consideration of therapeutic benefits to patient-reported outcomes (PROs) may add value to the traditional biomedical clinical trial endpoints. We extend a popular class of joint models for longitudinal and survival data to accommodate the excessive zeros common in PROs, building hierarchical Bayesian models that combine information from longitudinal PRO measurements and survival outcomes. The model development is motivated by a clinical trial for malignant pleural mesothelioma, a rapidly fatal form of pulmonary cancer usually associated with asbestos exposure. By separately modeling the presence and severity of PROs, using our zero-augmented beta (ZAB) likelihood, we are able to model PROs on their original scale and learn about individual-level parameters from both presence and severity of symptoms. Correlations among an individual's PROs and survival are modeled using latent random variables, adjusting the fitted trajectories to better accommodate the observed data for each individual. This work contributes to understanding the impact of treatment on two aspects of mesothelioma: patients’ subjective experience of the disease process and their progression-free survival times. We uncover important differences between outcome types that are associated with therapy (periodic, worse in both treatment groups after therapy initiation) and those that are responsive to treatment (aperiodic, gradually widening gap between treatment groups). Finally, our work raises questions for future investigation into multivariate modeling, choice of link functions, and the relative contributions of multiple data sources in joint modeling contexts. Journal: Journal of the American Statistical Association Pages: 875-885 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.664517 File-URL: http://hdl.handle.net/10.1080/01621459.2012.664517 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:875-885 Template-Type: ReDIF-Article 1.0 Author-Name: Sally Picciotto Author-X-Name-First: Sally Author-X-Name-Last: Picciotto Author-Name: Miguel A. Hernán Author-X-Name-First: Miguel A. Author-X-Name-Last: Hernán Author-Name: John H. Page Author-X-Name-First: John H. Author-X-Name-Last: Page Author-Name: Jessica G. Young Author-X-Name-First: Jessica G. Author-X-Name-Last: Young Author-Name: James M. Robins Author-X-Name-First: James M. Author-X-Name-Last: Robins Title: Structural Nested Cumulative Failure Time Models to Estimate the Effects of Interventions Abstract: In the presence of time-varying confounders affected by prior treatment, standard statistical methods for failure time analysis may be biased. Methods that correctly adjust for this type of covariate include the parametric g-formula, inverse probability weighted estimation of marginal structural Cox proportional hazards models, and g-estimation of structural nested accelerated failure time models. In this article, we propose a novel method to estimate the causal effect of a time-dependent treatment on failure in the presence of informative right-censoring and time-dependent confounders that may be affected by past treatment: g-estimation of structural nested cumulative failure time models (SNCFTMs). An SNCFTM considers the conditional effect of a final treatment at time m on the outcome at each later time k by modeling the ratio of two counterfactual cumulative risks at time k under treatment regimes that differ only at time m. Inverse probability weights are used to adjust for informative censoring. We also present a procedure that, under certain “no-interaction” conditions, uses the g-estimates of the model parameters to calculate unconditional cumulative risks under nondynamic (static) treatment regimes. The procedure is illustrated with an example using data from a longitudinal cohort study, in which the “treatments” are healthy behaviors and the outcome is coronary heart disease. Journal: Journal of the American Statistical Association Pages: 886-900 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.682532 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682532 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:886-900 Template-Type: ReDIF-Article 1.0 Author-Name: José R. Zubizarreta Author-X-Name-First: José R. Author-X-Name-Last: Zubizarreta Author-Name: Mark Neuman Author-X-Name-First: Mark Author-X-Name-Last: Neuman Author-Name: Jeffrey H. Silber Author-X-Name-First: Jeffrey H. Author-X-Name-Last: Silber Author-Name: Paul R. Rosenbaum Author-X-Name-First: Paul R. Author-X-Name-Last: Rosenbaum Title: Contrasting Evidence Within and Between Institutions That Provide Treatment in an Observational Study of Alternate Forms of Anesthesia Abstract: In a randomized trial, subjects are assigned to treatment or control by the flip of a fair coin. In many nonrandomized or observational studies, subjects find their way to treatment or control in two steps, either or both of which may lead to biased comparisons. By a vague process, perhaps affected by proximity or sociodemographic issues, subjects find their way to institutions that provide treatment. Once at such an institution, a second process, perhaps thoughtful and deliberate, assigns individuals to treatment or control. In the current article, the institutions are hospitals, and the treatment under study is the use of general anesthesia alone versus some use of regional anesthesia during surgery. For a specific operation, the use of regional anesthesia may be typical in one hospital and atypical in another. A new matched design is proposed for studies of this sort, one that creates two types of nonoverlapping matched pairs. Using a new extension of optimal matching with fine balance, pairs of the first type exactly balance treatment assignment across institutions, so each institution appears in the treated group with the same frequency that it appears in the control group; hence, differences between institutions that affect everyone in the same way cannot bias this comparison. Pairs of the second type compare institutions that assign most subjects to treatment and other institutions that assign most subjects to control, so each institution is represented in the treated group if it typically assigns subjects to treatment or, alternatively, in the control group if it typically assigns subjects to control, and no institution appears in both groups. By and large, in the second type of matched pair, subjects became treated subjects or controls by choosing an institution, not by a thoughtful and deliberate process of selecting subjects for treatment within institutions. The design provides two evidence factors, that is, two tests of the null hypothesis of no treatment effect that are independent when the null hypothesis is true, where each factor is largely unaffected by certain unmeasured biases that could readily invalidate the other factor. The two factors permit separate and combined sensitivity analyses, where the magnitude of bias affecting the two factors may differ. The case of knee surgery in the study of regional versus general anesthesia is considered in detail. Journal: Journal of the American Statistical Association Pages: 901-915 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.682533 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682533 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:901-915 Template-Type: ReDIF-Article 1.0 Author-Name: Qirong Ho Author-X-Name-First: Qirong Author-X-Name-Last: Ho Author-Name: Ankur P. Parikh Author-X-Name-First: Ankur P. Author-X-Name-Last: Parikh Author-Name: Eric P. Xing Author-X-Name-First: Eric P. Author-X-Name-Last: Xing Title: A Multiscale Community Blockmodel for Network Exploration Abstract: Real-world networks exhibit a complex set of phenomena such as underlying hierarchical organization, multiscale interaction, and varying topologies of communities. Most existing methods do not adequately capture the intrinsic interplay among such phenomena. We propose a nonparametric multiscale community blockmodel (MSCB) to model the generation of hierarchies in social communities, selective membership of actors to subsets of these communities, and the resultant networks due to within- and cross-community interactions. By using the nested Chinese restaurant process, our model automatically infers the hierarchy structure from the data. We develop a collapsed Gibbs sampling algorithm for posterior inference, conduct extensive validation using synthetic networks, and demonstrate the utility of our model in real-world datasets, such as predator--prey networks and citation networks. Journal: Journal of the American Statistical Association Pages: 916-934 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.682530 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682530 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:916-934 Template-Type: ReDIF-Article 1.0 Author-Name: Jianhua Hu Author-X-Name-First: Jianhua Author-X-Name-Last: Hu Author-Name: Xuming He Author-X-Name-First: Xuming Author-X-Name-Last: He Title: Searching for Alternative Splicing With a Joint Model on Probe Measurability and Expression Intensities Abstract: The exon tiling array offers a high throughput technology to search for aberrant splicing in biomedical research, but few methods of analysis for splicing detection have been tested both statistically and empirically. Noisy measurements on nonresponsive probe selection regions and outlying intensities at some of the samples tend to distort model-based assessments. We propose a robust analysis of variance approach that incorporates an informative model on probe measurability and uses median regression rank scores for better reliability in alternative splicing detection. We study the validity and effectiveness of our proposed approach in contrast with some of the existing methods through an empirical investigation of a brain cancer experiment, where a set of biologically validated genes for splicing and nonsplicing are available. Our study demonstrates favorable performance of the proposed ranking method, but shows that analysis of statistical significance cannot be trusted from any conventional use of p-values. We warn of any routine attempt to interpret p-values and their derivatives in model-based detection of alternative splicing. Journal: Journal of the American Statistical Association Pages: 935-945 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.682801 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682801 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:935-945 Template-Type: ReDIF-Article 1.0 Author-Name: Chiung-yu Huang Author-X-Name-First: Chiung-yu Author-X-Name-Last: Huang Author-Name: Jing Qin Author-X-Name-First: Jing Author-X-Name-Last: Qin Title: Composite Partial Likelihood Estimation Under Length-Biased Sampling, With Application to a Prevalent Cohort Study of Dementia Abstract: The Canadian Study of Health and Aging (CSHA) employed a prevalent cohort design to study survival after onset of dementia, where patients with dementia were sampled and the onset time of dementia was determined retrospectively. The prevalent cohort sampling scheme favors individuals who survive longer. Thus, the observed survival times are subject to length bias. In recent years, there has been a rising interest in developing estimation procedures for prevalent cohort survival data that not only account for length bias but also actually exploit the incidence distribution of the disease to improve efficiency. This article considers semiparametric estimation of the Cox model for the time from dementia onset to death under a stationarity assumption with respect to the disease incidence. Under the stationarity condition, the semiparametric maximum likelihood estimation is expected to be fully efficient yet difficult to perform for statistical practitioners, as the likelihood depends on the baseline hazard function in a complicated way. Moreover, the asymptotic properties of the semiparametric maximum likelihood estimator are not well-studied. Motivated by the composite likelihood method (Besag 1974), we develop a composite partial likelihood method that retains the simplicity of the popular partial likelihood estimator and can be easily performed using standard statistical software. When applied to the CSHA data, the proposed method estimates a significant difference in survival between the vascular dementia group and the possible Alzheimer's disease group, while the partial likelihood method for left-truncated and right-censored data yields a greater standard error and a 95% confidence interval covering 0, thus highlighting the practical value of employing a more efficient methodology. To check the assumption of stable disease for the CSHA data, we also present new graphical and numerical tests in the article. The R code used to obtain the maximum composite partial likelihood estimator for the CSHA data is available in the online Supplementary Material, posted on the journal web site. Journal: Journal of the American Statistical Association Pages: 946-957 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.682544 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682544 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:946-957 Template-Type: ReDIF-Article 1.0 Author-Name: Brent Kreider Author-X-Name-First: Brent Author-X-Name-Last: Kreider Author-Name: John V. Pepper Author-X-Name-First: John V. Author-X-Name-Last: Pepper Author-Name: Craig Gundersen Author-X-Name-First: Craig Author-X-Name-Last: Gundersen Author-Name: Dean Jolliffe Author-X-Name-First: Dean Author-X-Name-Last: Jolliffe Title: Identifying the Effects of SNAP (Food Stamps) on Child Health Outcomes When Participation Is Endogenous and Misreported Abstract: The literature assessing the efficacy of the Supplemental Nutrition Assistance Program (SNAP), formerly known as the Food Stamp Program, has long puzzled over positive associations between SNAP receipt and various undesirable health outcomes such as food insecurity. Assessing the causal impacts of SNAP, however, is hampered by two key identification problems: endogenous selection into participation and extensive systematic underreporting of participation status. Using data from the National Health and Nutrition Examination Survey (NHANES), we extend partial identification bounding methods to account for these two identification problems in a single unifying framework. Specifically, we derive informative bounds on the average treatment effect (ATE) of SNAP on child food insecurity, poor general health, obesity, and anemia across a range of different assumptions used to address the selection and classification error problems. In particular, to address the selection problem, we apply relatively weak nonparametric assumptions on the latent outcomes, selected treatments, and observed covariates. To address the classification error problem, we formalize a new approach that uses auxiliary administrative data on the size of the SNAP caseload to restrict the magnitudes and patterns of SNAP reporting errors. Layering successively stronger assumptions, an objective of our analysis is to make transparent how the strength of the conclusions varies with the strength of the identifying assumptions. Under the weakest restrictions, there is substantial ambiguity; we cannot rule out the possibility that SNAP increases or decreases poor health. Under stronger but plausible assumptions used to address the selection and classification error problems, we find that commonly cited relationships between SNAP and poor health outcomes provide a misleading picture about the true impacts of the program. Our tightest bounds identify favorable impacts of SNAP on child health. Journal: Journal of the American Statistical Association Pages: 958-975 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.682828 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682828 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:958-975 Template-Type: ReDIF-Article 1.0 Author-Name: María José García-Zattera Author-X-Name-First: María José Author-X-Name-Last: García-Zattera Author-Name: Alejandro Jara Author-X-Name-First: Alejandro Author-X-Name-Last: Jara Author-Name: Emmanuel Lesaffre Author-X-Name-First: Emmanuel Author-X-Name-Last: Lesaffre Author-Name: Guillermo Marshall Author-X-Name-First: Guillermo Author-X-Name-Last: Marshall Title: Modeling of Multivariate Monotone Disease Processes in the Presence of Misclassification Abstract: Motivated by a longitudinal oral health study, the Signal--Tandmobiel® study, we propose a multivariate binary inhomogeneous Markov model in which unobserved correlated response variables are subject to an unconstrained misclassification process and have a monotone behavior. The multivariate baseline distributions and Markov transition matrices of the unobserved processes are defined as a function of covariates through the specification of compatible full conditional distributions. Distinct misclassification models are discussed. In all cases, the possibility that different examiners were involved in the scoring of the responses of a given subject across time is taken into account. A full Bayesian implementation of the model is described and its performance is evaluated using simulated data. We provide theoretical and empirical evidence that the parameters can be estimated without any external information about the misclassification parameters. Finally, the analyses of the motivating study are presented. Appendices 1--7 are available in the online supplementary materials. Journal: Journal of the American Statistical Association Pages: 976-989 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.682804 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682804 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:976-989 Template-Type: ReDIF-Article 1.0 Author-Name: A. Adam Ding Author-X-Name-First: A. Adam Author-X-Name-Last: Ding Author-Name: Shaonan Tian Author-X-Name-First: Shaonan Author-X-Name-Last: Tian Author-Name: Yan Yu Author-X-Name-First: Yan Author-X-Name-Last: Yu Author-Name: Hui Guo Author-X-Name-First: Hui Author-X-Name-Last: Guo Title: A Class of Discrete Transformation Survival Models With Application to Default Probability Prediction Abstract: Corporate bankruptcy prediction plays a central role in academic finance research, business practice, and government regulation. Consequently, accurate default probability prediction is extremely important. We propose to apply a discrete transformation family of survival models to corporate default risk predictions. A class of Box-Cox transformations and logarithmic transformations is naturally adopted. The proposed transformation model family is shown to include the popular Shumway model and the grouped relative risk model. We show that a transformation parameter different from those two models is needed for default prediction using a bankruptcy dataset. In addition, we show using out-of-sample validation statistics that our model improves performance. We use the estimated default probability to examine a popular asset pricing question and determine whether default risk has carried a premium. Due to some distinct features of the bankruptcy application, the proposed class of discrete transformation survival models with time-varying covariates is different from the continuous survival models in the survival analysis literature. Their similarities and differences are discussed. Journal: Journal of the American Statistical Association Pages: 990-1003 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.682806 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682806 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:990-1003 Template-Type: ReDIF-Article 1.0 Author-Name: Hai Nguyen Author-X-Name-First: Hai Author-X-Name-Last: Nguyen Author-Name: Noel Cressie Author-X-Name-First: Noel Author-X-Name-Last: Cressie Author-Name: Amy Braverman Author-X-Name-First: Amy Author-X-Name-Last: Braverman Title: Spatial Statistical Data Fusion for Remote Sensing Applications Abstract: Aerosols are tiny solid or liquid particles suspended in the atmosphere; examples of aerosols include windblown dust, sea salts, volcanic ash, smoke from wildfires, and pollution from factories. The global distribution of aerosols is a topic of great interest in climate studies since aerosols can either cool or warm the atmosphere depending on their location, type, and interaction with clouds. Aerosol concentrations are important input components of global climate models, and it is crucial to accurately estimate aerosol concentrations from remote sensing instruments so as to minimize errors “downstream” in climate models. Currently, space-based observations of aerosols are available from two remote sensing instruments on board NASA's Terra spacecraft: the Multiangle Imaging SpectroRadiometer (MISR), and the MODerate-resolution Imaging Spectrometer (MODIS). These two instruments have complementary coverage, spatial support, and retrieval characteristics, making it advantageous to combine information from both sources to make optimal inferences about global aerosol distributions. In this article, we predict the true aerosol process from two noisy and possibly biased datasets, and we also estimate the uncertainties of these estimates. Our data-fusion methodology scales linearly and bears some resemblance to Fixed Rank Kriging (FRK), a variant of kriging that is designed for spatial interpolation of a single, massive dataset. Our spatial statistical approach does not require assumptions of stationarity or isotropy and, crucially, allows for change of spatial support. We compare our methodology to FRK and Bayesian melding, and we show that ours has superior prediction standard errors compared to FRK and much faster computational speed compared to Bayesian melding. Journal: Journal of the American Statistical Association Pages: 1004-1018 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.694717 File-URL: http://hdl.handle.net/10.1080/01621459.2012.694717 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1004-1018 Template-Type: ReDIF-Article 1.0 Author-Name: Qin Zhou Author-X-Name-First: Qin Author-X-Name-Last: Zhou Author-Name: Changliang Zou Author-X-Name-First: Changliang Author-X-Name-Last: Zou Author-Name: Zhaojun Wang Author-X-Name-First: Zhaojun Author-X-Name-Last: Wang Author-Name: Wei Jiang Author-X-Name-First: Wei Author-X-Name-Last: Jiang Title: Likelihood-Based EWMA Charts for Monitoring Poisson Count Data With Time-Varying Sample Sizes Abstract: Many applications involve monitoring incidence rates of the Poisson distribution when the sample size varies over time. Recently, a couple of cumulative sum and exponentially weighted moving average (EWMA) control charts have been proposed to tackle this problem by taking the varying sample size into consideration. However, we argue that some of these charts, which perform quite well in terms of average run length (ARL), may not be appealing in practice because they have rather unsatisfactory run length distributions. With some charts, the specified in-control (IC) ARL is attained with elevated probabilities of very short and very long runs, as compared with a geometric distribution. This is reflected in a larger run length standard deviation than that of a geometric distribution and an elevated probability of false alarms with short runs, which, in turn, hurt an operator's confidence in valid alarms. Furthermore, with many charts, the IC ARL exhibits considerable variations with different patterns of sample sizes. Under the framework of weighted likelihood ratio test, this article suggests a new EWMA control chart which automatically integrates the varying sample sizes with the EWMA scheme. It is fast to compute, easy to construct, and quite efficient in detecting changes of Poisson rates. Two important features of the proposed method are that the IC run length distribution is similar to that of a geometric distribution and the IC ARL is robust to various patterns of sample size variation. Our simulation results show that the proposed chart is generally more effective and robust compared with existing EWMA charts. A health surveillance example based on mortality data from New Mexico is used to illustrate the implementation of the proposed method. This article has online supplementary materials. Journal: Journal of the American Statistical Association Pages: 1049-1062 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.682811 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682811 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1049-1062 Template-Type: ReDIF-Article 1.0 Author-Name: Anastasios Panagiotelis Author-X-Name-First: Anastasios Author-X-Name-Last: Panagiotelis Author-Name: Claudia Czado Author-X-Name-First: Claudia Author-X-Name-Last: Czado Author-Name: Harry Joe Author-X-Name-First: Harry Author-X-Name-Last: Joe Title: Pair Copula Constructions for Multivariate Discrete Data Abstract: Multivariate discrete response data can be found in diverse fields, including econometrics, finance, biometrics, and psychometrics. Our contribution, through this study, is to introduce a new class of models for multivariate discrete data based on pair copula constructions (PCCs) that has two major advantages. First, by deriving the conditions under which any multivariate discrete distribution can be decomposed as a PCC, we show that discrete PCCs attain highly flexible dependence structures. Second, the computational burden of evaluating the likelihood for an m-dimensional discrete PCC only grows quadratically with m. This compares favorably to existing models for which computing the likelihood either requires the evaluation of 2-super- m terms or slow numerical integration methods. We demonstrate the high quality of inference function for margins and maximum likelihood estimates, both under a simulated setting and for an application to a longitudinal discrete dataset on headache severity. This article has online supplementary material. Journal: Journal of the American Statistical Association Pages: 1063-1072 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.682850 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682850 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1063-1072 Template-Type: ReDIF-Article 1.0 Author-Name: Jens-peter Kreiss Author-X-Name-First: Jens-peter Author-X-Name-Last: Kreiss Author-Name: Efstathios Paparoditis Author-X-Name-First: Efstathios Author-X-Name-Last: Paparoditis Title: The Hybrid Wild Bootstrap for Time Series Abstract: We introduce a new and simple bootstrap procedure for general linear processes, called the hybrid wild bootstrap. The hybrid wild bootstrap generates frequency domain replicates of the periodogram that imitate asymptotically correct the first- and second-order properties of the ordinary periodogram including its weak dependence structure at different frequencies. As a consequence, the hybrid wild bootstrapped periodogram succeeds in approximating consistently the distribution of statistics that can be expressed as functionals of the periodogram, including the important class of spectral means for which all so far existing frequency domain bootstrap methods generally fail. Moreover, by inverting the hybrid wild bootstrapped discrete Fourier transform, pseudo-observations in the time domain are obtained. The generated time domain pseudo-observations can be used to approximate correctly the random behavior of statistics, the distribution of which depends on the first-, second-, and, to some extent, on the fourth-order structure of the underlying linear process. Thus, the proposed hybrid wild bootstrap procedure applied to general time series overcomes several of the limitations of standard linear time domain bootstrap methods. Journal: Journal of the American Statistical Association Pages: 1073-1084 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.695664 File-URL: http://hdl.handle.net/10.1080/01621459.2012.695664 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1073-1084 Template-Type: ReDIF-Article 1.0 Author-Name: Victor M. Panaretos Author-X-Name-First: Victor M. Author-X-Name-Last: Panaretos Author-Name: Kjell Konis Author-X-Name-First: Kjell Author-X-Name-Last: Konis Title: Nonparametric Construction of Multivariate Kernels Abstract: We propose a nonparametric method for constructing multivariate kernels tuned to the configuration of the sample, for density estimation in , d moderate. The motivation behind the approach is to break down the construction of the kernel into two parts: determining its overall shape and then its global concentration. We consider a framework that is essentially nonparametric, as opposed to the usual bandwidth matrix parameterization. The shape of the kernel to be employed is determined by applying the backprojection operator, the dual of the Radon transform, to a collection of one-dimensional kernels, each optimally tuned to the concentration of the corresponding one-dimensional projections of the data. Once an overall shape is determined, the global concentration is controlled by a simple scaling. It is seen that the kernel estimators thus developed are easy and extremely fast to compute, and perform at least as well in practice as parametric kernels with cross-validated or otherwise tuned covariance structure. Connections with integral geometry are discussed, and the approach is illustrated under a wide range of scenarios in two and three dimensions, via an R package developed for its implementation. Journal: Journal of the American Statistical Association Pages: 1085-1095 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.695657 File-URL: http://hdl.handle.net/10.1080/01621459.2012.695657 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1085-1095 Template-Type: ReDIF-Article 1.0 Author-Name: Jiahua Chen Author-X-Name-First: Jiahua Author-X-Name-Last: Chen Author-Name: Pengfei Li Author-X-Name-First: Pengfei Author-X-Name-Last: Li Author-Name: Yuejiao Fu Author-X-Name-First: Yuejiao Author-X-Name-Last: Fu Title: Inference on the Order of a Normal Mixture Abstract: Finite normal mixture models are used in a wide range of applications. Hypothesis testing on the order of the normal mixture is an important yet unsolved problem. Existing procedures often lack a rigorous theoretical foundation. Many are also hard to implement numerically. In this article, we develop a new method to fill the void in this important area. An effective expectation-maximization (EM) test is invented for testing the null hypothesis of arbitrary order m 0 under a finite normal mixture model. For any positive integer m 0 ⩾ 2, the limiting distribution of the proposed test statistic is . We also use a novel computer experiment to provide empirical formulas for the tuning parameter selection. The finite sample performance of the test is examined through simulation studies. Real-data examples are provided. The procedure has been implemented in R code. The p-values for testing the null order of m 0 = 2 or m 0 = 3 can be calculated with a single command. This article has supplementary materials available online. Journal: Journal of the American Statistical Association Pages: 1096-1105 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.695668 File-URL: http://hdl.handle.net/10.1080/01621459.2012.695668 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1096-1105 Template-Type: ReDIF-Article 1.0 Author-Name: Yingqi Zhao Author-X-Name-First: Yingqi Author-X-Name-Last: Zhao Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: A. John Rush Author-X-Name-First: A. John Author-X-Name-Last: Rush Author-Name: Michael R. Kosorok Author-X-Name-First: Michael R. Author-X-Name-Last: Kosorok Title: Estimating Individualized Treatment Rules Using Outcome Weighted Learning Abstract: There is increasing interest in discovering individualized treatment rules (ITRs) for patients who have heterogeneous responses to treatment. In particular, one aims to find an optimal ITR that is a deterministic function of patient-specific characteristics maximizing expected clinical outcome. In this article, we first show that estimating such an optimal treatment rule is equivalent to a classification problem where each subject is weighted proportional to his or her clinical outcome. We then propose an outcome weighted learning approach based on the support vector machine framework. We show that the resulting estimator of the treatment rule is consistent. We further obtain a finite sample bound for the difference between the expected outcome using the estimated ITR and that of the optimal treatment rule. The performance of the proposed approach is demonstrated via simulation studies and an analysis of chronic depression data. Journal: Journal of the American Statistical Association Pages: 1106-1118 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.695674 File-URL: http://hdl.handle.net/10.1080/01621459.2012.695674 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1106-1118 Template-Type: ReDIF-Article 1.0 Author-Name: Daniel L. Sussman Author-X-Name-First: Daniel L. Author-X-Name-Last: Sussman Author-Name: Minh Tang Author-X-Name-First: Minh Author-X-Name-Last: Tang Author-Name: Donniell E. Fishkind Author-X-Name-First: Donniell E. Author-X-Name-Last: Fishkind Author-Name: Carey E. Priebe Author-X-Name-First: Carey E. Author-X-Name-Last: Priebe Title: A Consistent Adjacency Spectral Embedding for Stochastic Blockmodel Graphs Abstract: We present a method to estimate block membership of nodes in a random graph generated by a stochastic blockmodel. We use an embedding procedure motivated by the random dot product graph model, a particular example of the latent position model. The embedding associates each node with a vector; these vectors are clustered via minimization of a square error criterion. We prove that this method is consistent for assigning nodes to blocks, as only a negligible number of nodes will be misassigned. We prove consistency of the method for directed and undirected graphs. The consistent block assignment makes possible consistent parameter estimation for a stochastic blockmodel. We extend the result in the setting where the number of blocks grows slowly with the number of nodes. Our method is also computationally feasible even for very large graphs. We compare our method with Laplacian spectral clustering through analysis of simulated data and a graph derived from Wikipedia documents. Journal: Journal of the American Statistical Association Pages: 1119-1128 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.699795 File-URL: http://hdl.handle.net/10.1080/01621459.2012.699795 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1119-1128 Template-Type: ReDIF-Article 1.0 Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Author-Name: Wei Zhong Author-X-Name-First: Wei Author-X-Name-Last: Zhong Author-Name: Liping Zhu Author-X-Name-First: Liping Author-X-Name-Last: Zhu Title: Feature Screening via Distance Correlation Learning Abstract: This article is concerned with screening features in ultrahigh-dimensional data analysis, which has become increasingly important in diverse scientific fields. We develop a sure independence screening procedure based on the distance correlation (DC-SIS). The DC-SIS can be implemented as easily as the sure independence screening (SIS) procedure based on the Pearson correlation proposed by Fan and Lv. However, the DC-SIS can significantly improve the SIS. Fan and Lv established the sure screening property for the SIS based on linear models, but the sure screening property is valid for the DC-SIS under more general settings, including linear models. Furthermore, the implementation of the DC-SIS does not require model specification (e.g., linear model or generalized linear model) for responses or predictors. This is a very appealing property in ultrahigh-dimensional data analysis. Moreover, the DC-SIS can be used directly to screen grouped predictor variables and multivariate response variables. We establish the sure screening property for the DC-SIS, and conduct simulations to examine its finite sample performance. A numerical comparison indicates that the DC-SIS performs much better than the SIS in various models. We also illustrate the DC-SIS through a real-data example. Journal: Journal of the American Statistical Association Pages: 1129-1139 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.695654 File-URL: http://hdl.handle.net/10.1080/01621459.2012.695654 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1129-1139 Template-Type: ReDIF-Article 1.0 Author-Name: Holger Dette Author-X-Name-First: Holger Author-X-Name-Last: Dette Author-Name: Matthias Trampisch Author-X-Name-First: Matthias Author-X-Name-Last: Trampisch Title: Optimal Designs for Quantile Regression Models Abstract: Despite their importance, optimal designs for quantile regression models have not been developed so far. In this article, we investigate the D-optimal design problem for nonlinear quantile regression analysis. We provide a necessary condition to check the optimality of a given design and use it to determine bounds for the number of support points of locally D-optimal designs. The results are illustrated, determining locally, Bayesian and standardized maximin D-optimal designs for quantile regression analysis in the Michaelis--Menten and EMAX model, which are widely used in such important fields as toxicology, pharmacokinetics, and dose--response modeling. Journal: Journal of the American Statistical Association Pages: 1140-1151 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.695665 File-URL: http://hdl.handle.net/10.1080/01621459.2012.695665 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1140-1151 Template-Type: ReDIF-Article 1.0 Author-Name: Sebastian Kurtek Author-X-Name-First: Sebastian Author-X-Name-Last: Kurtek Author-Name: Anuj Srivastava Author-X-Name-First: Anuj Author-X-Name-Last: Srivastava Author-Name: Eric Klassen Author-X-Name-First: Eric Author-X-Name-Last: Klassen Author-Name: Zhaohua Ding Author-X-Name-First: Zhaohua Author-X-Name-Last: Ding Title: Statistical Modeling of Curves Using Shapes and Related Features Abstract: Motivated by the problems of analyzing protein backbones, diffusion tensor magnetic resonance imaging (DT-MRI) fiber tracts in the human brain, and other problems involving curves, in this study we present some statistical models of parameterized curves, in , in terms of combinations of features such as shape, location, scale, and orientation. For each combination of interest, we identify a representation manifold, endow it with a Riemannian metric, and outline tools for computing sample statistics on these manifolds. An important characteristic of the chosen representations is that the ensuing comparison and modeling of curves is invariant to how the curves are parameterized. The nuisance variables, including parameterization, are removed by forming quotient spaces under appropriate group actions. In the case of shape analysis, the resulting spaces are quotient spaces of Hilbert spheres, and we derive certain wrapped truncated normal densities for capturing variability in observed curves. We demonstrate these models using both artificial data and real data involving DT-MRI fiber tracts from multiple subjects and protein backbones from the Shape Retrieval Contest of Non-rigid 3D Models (SHREC) 2010 database. Journal: Journal of the American Statistical Association Pages: 1152-1165 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.699770 File-URL: http://hdl.handle.net/10.1080/01621459.2012.699770 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1152-1165 Template-Type: ReDIF-Article 1.0 Author-Name: Raymond Carroll Author-X-Name-First: Raymond Author-X-Name-Last: Carroll Author-Name: Aurore Delaigle Author-X-Name-First: Aurore Author-X-Name-Last: Delaigle Author-Name: Peter Hall Author-X-Name-First: Peter Author-X-Name-Last: Hall Title: Deconvolution When Classifying Noisy Data Involving Transformations Abstract: In the present study, we consider the problem of classifying spatial data distorted by a linear transformation or convolution and contaminated by additive random noise. In this setting, we show that classifier performance can be improved if we carefully invert the data before the classifier is applied. However, the inverse transformation is not constructed so as to recover the original signal, and in fact, we show that taking the latter approach is generally inadvisable. We introduce a fully data-driven procedure based on cross-validation, and use several classifiers to illustrate numerical properties of our approach. Theoretical arguments are given in support of our claims. Our procedure is applied to data generated by light detection and ranging (Lidar) technology, where we improve on earlier approaches to classifying aerosols. This article has supplementary materials online. Journal: Journal of the American Statistical Association Pages: 1166-1177 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.699793 File-URL: http://hdl.handle.net/10.1080/01621459.2012.699793 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1166-1177 Template-Type: ReDIF-Article 1.0 Author-Name: Mike Danilov Author-X-Name-First: Mike Author-X-Name-Last: Danilov Author-Name: Víctor J. Yohai Author-X-Name-First: Víctor J. Author-X-Name-Last: Yohai Author-Name: Ruben H. Zamar Author-X-Name-First: Ruben H. Author-X-Name-Last: Zamar Title: Robust Estimation of Multivariate Location and Scatter in the Presence of Missing Data Abstract: Two main issues regarding data quality are data contamination (outliers) and data completion (missing data). These two problems have attracted much attention and research but surprisingly, they are seldom considered together. Popular robust methods such as S-estimators of multivariate location and scatter offer protection against outliers but cannot deal with missing data, except for the obviously inefficient approach of deleting all incomplete cases. We generalize the definition of S-estimators of multivariate location and scatter to simultaneously deal with missing data and outliers. We show that the proposed estimators are strongly consistent under elliptical models when data are missing completely at random. We derive an algorithm similar to the Expectation-Maximization algorithm for computing the proposed estimators. This algorithm is initialized by an extension for missing data of the minimum volume ellipsoid. We assess the performance of our proposal by Monte Carlo simulation and give some real data examples. This article has supplementary material online. Journal: Journal of the American Statistical Association Pages: 1178-1186 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.699792 File-URL: http://hdl.handle.net/10.1080/01621459.2012.699792 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1178-1186 Template-Type: ReDIF-Article 1.0 Author-Name: Chenlei Leng Author-X-Name-First: Chenlei Author-X-Name-Last: Leng Author-Name: Cheng Yong Tang Author-X-Name-First: Cheng Yong Author-X-Name-Last: Tang Title: Sparse Matrix Graphical Models Abstract: Matrix-variate observations are frequently encountered in many contemporary statistical problems due to a rising need to organize and analyze data with structured information. In this article, we propose a novel sparse matrix graphical model for these types of statistical problems. By penalizing, respectively, two precision matrices corresponding to the rows and columns, our method yields a sparse matrix graphical model that synthetically characterizes the underlying conditional independence structure. Our model is more parsimonious and is practically more interpretable than the conventional sparse vector-variate graphical models. Asymptotic analysis shows that our penalized likelihood estimates enjoy better convergent rates than that of the vector-variate graphical model. The finite sample performance of the proposed method is illustrated via extensive simulation studies and several real datasets analysis. Journal: Journal of the American Statistical Association Pages: 1187-1200 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.706133 File-URL: http://hdl.handle.net/10.1080/01621459.2012.706133 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1187-1200 Template-Type: ReDIF-Article 1.0 Author-Name: Gabriel Chandler Author-X-Name-First: Gabriel Author-X-Name-Last: Chandler Author-Name: Wolfgang Polonik Author-X-Name-First: Wolfgang Author-X-Name-Last: Polonik Title: Mode Identification of Volatility in Time-Varying Autoregression Abstract: In many applications, time series exhibit nonstationary behavior that might reasonably be modeled as a time-varying autoregressive (AR) process. In the context of such a model, we discuss the problem of testing for modality of the variance function. We propose a test of modality that is local and, when used iteratively, can be used to identify the total number of modes in a given series. This problem is closely related to peak detection and identification, which has applications in many fields. We propose a test that, under appropriate assumptions, is asymptotically distribution free under the null hypothesis, even though nonparametric estimation of the AR parameter functions is involved. Simulation studies and applications to real datasets illustrate the behavior of the test. Journal: Journal of the American Statistical Association Pages: 1217-1229 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.703877 File-URL: http://hdl.handle.net/10.1080/01621459.2012.703877 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1217-1229 Template-Type: ReDIF-Article 1.0 Author-Name: Shurong Zheng Author-X-Name-First: Shurong Author-X-Name-Last: Zheng Author-Name: Ning-Zhong Shi Author-X-Name-First: Ning-Zhong Author-X-Name-Last: Shi Author-Name: Zhengjun Zhang Author-X-Name-First: Zhengjun Author-X-Name-Last: Zhang Title: Generalized Measures of Correlation for Asymmetry, Nonlinearity, and Beyond Abstract: Applicability of Pearson's correlation as a measure of explained variance is by now well understood. One of its limitations is that it does not account for asymmetry in explained variance. Aiming to develop broad applicable correlation measures, we study a pair of generalized measures of correlation (GMC) that deals with asymmetries in explained variances, and linear or nonlinear relations between random variables. We present examples under which the paired measures are identical, and they become a symmetric correlation measure that is the same as the squared Pearson's correlation coefficient. As a result, Pearson's correlation is a special case of GMC. Theoretical properties of GMC show that GMC can be applicable in numerous applications and can lead to more meaningful conclusions and improved decision making. In statistical inference, the joint asymptotics of the kernel-based estimators for GMC are derived and are used to test whether or not two random variables are symmetric in explaining variances. The testing results give important guidance in practical model selection problems. The efficiency of the test statistics is illustrated in simulation examples. In real-data analysis, we present an important application of GMC in explained variances and market movements among three important economic and financial monetary indicators. This article has online supplementary materials. Journal: Journal of the American Statistical Association Pages: 1239-1252 Issue: 499 Volume: 107 Year: 2012 Month: 9 X-DOI: 10.1080/01621459.2012.710509 File-URL: http://hdl.handle.net/10.1080/01621459.2012.710509 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1239-1252 Template-Type: ReDIF-Article 1.0 Author-Name: William Astle Author-X-Name-First: William Author-X-Name-Last: Astle Author-Name: Maria De Iorio Author-X-Name-First: Maria Author-X-Name-Last: De Iorio Author-Name: Sylvia Richardson Author-X-Name-First: Sylvia Author-X-Name-Last: Richardson Author-Name: David Stephens Author-X-Name-First: David Author-X-Name-Last: Stephens Author-Name: Timothy Ebbels Author-X-Name-First: Timothy Author-X-Name-Last: Ebbels Title: A Bayesian Model of NMR Spectra for the Deconvolution and Quantification of Metabolites in Complex Biological Mixtures Abstract: Nuclear magnetic resonance (NMR) spectra are widely used in metabolomics to obtain profiles of metabolites dissolved in biofluids such as cell supernatants. Methods for estimating metabolite concentrations from these spectra are presently confined to manual peak fitting and to binning procedures for integrating resonance peaks. Extensive information on the patterns of spectral resonance generated by human metabolites is now available in online databases. By incorporating this information into a Bayesian model, we can deconvolve resonance peaks from a spectrum and obtain explicit concentration estimates for the corresponding metabolites. Spectral resonances that cannot be deconvolved in this way may also be of scientific interest; so, we model them jointly using wavelets. We describe a Markov chain Monte Carlo algorithm that allows us to sample from the joint posterior distribution of the model parameters, using specifically designed block updates to improve mixing. The strong prior on resonance patterns allows the algorithm to identify peaks corresponding to particular metabolites automatically, eliminating the need for manual peak assignment. We assess our method for peak alignment and concentration estimation. Except in cases when the target resonance signal is very weak, alignment is unbiased and precise. We compare the Bayesian concentration estimates with those obtained from a conventional numerical integration method and find that our point estimates have six-fold lower mean squared error. Finally, we apply our method to a spectral dataset taken from an investigation of the metabolic response of yeast to recombinant protein expression. We estimate the concentrations of 26 metabolites and compare with manual quantification by five expert spectroscopists. We discuss the reason for discrepancies and the robustness of our method's concentration estimates. This article has supplementary materials online. Journal: Journal of the American Statistical Association Pages: 1259-1271 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.695661 File-URL: http://hdl.handle.net/10.1080/01621459.2012.695661 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1259-1271 Template-Type: ReDIF-Article 1.0 Author-Name: Yuan Wang Author-X-Name-First: Yuan Author-X-Name-Last: Wang Author-Name: J. S. Marron Author-X-Name-First: J. S. Author-X-Name-Last: Marron Author-Name: Burcu Aydin Author-X-Name-First: Burcu Author-X-Name-Last: Aydin Author-Name: Alim Ladha Author-X-Name-First: Alim Author-X-Name-Last: Ladha Author-Name: Elizabeth Bullitt Author-X-Name-First: Elizabeth Author-X-Name-Last: Bullitt Author-Name: Haonan Wang Author-X-Name-First: Haonan Author-X-Name-Last: Wang Title: A Nonparametric Regression Model With Tree-Structured Response Abstract: Developments in science and technology over the last two decades has motivated the study of complex data objects. In this article, we consider the topological properties of a population of tree-structured objects. Our interest centers on modeling the relationship between a tree-structured response and other covariates. For tree-structured objects, this poses serious challenges since most regression methods rely on linear operations in Euclidean space. We generalize the notion of nonparametric regression to the case of a tree-structured response variable. In addition, we develop a fast algorithm and give its theoretical justification. We implement the proposed method to analyze a dataset of human brain artery trees. An important lesson is that smoothing in the full tree space can reveal much deeper scientific insights than the simple smoothing of summary statistics. This article has supplementary materials online. Journal: Journal of the American Statistical Association Pages: 1272-1285 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.699348 File-URL: http://hdl.handle.net/10.1080/01621459.2012.699348 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1272-1285 Template-Type: ReDIF-Article 1.0 Author-Name: Manuel Wiesenfarth Author-X-Name-First: Manuel Author-X-Name-Last: Wiesenfarth Author-Name: Tatyana Krivobokova Author-X-Name-First: Tatyana Author-X-Name-Last: Krivobokova Author-Name: Stephan Klasen Author-X-Name-First: Stephan Author-X-Name-Last: Klasen Author-Name: Stefan Sperlich Author-X-Name-First: Stefan Author-X-Name-Last: Sperlich Title: Direct Simultaneous Inference in Additive Models and Its Application to Model Undernutrition Abstract: This article proposes a simple and fast approach to build simultaneous confidence bands and perform specification tests for smooth curves in additive models. The method allows for handling of spatially heterogeneous functions and its derivatives as well as heteroscedasticity in the data. It is applied to study the determinants of chronic undernutrition of Kenyan children, with a particular focus on the highly nonlinear age pattern in undernutrition. Model estimation using the mixed model representation of penalized splines in combination with simultaneous probability calculations based on the volume-of-tube formula enable the simultaneous inference directly, that is, without resampling methods. Finite sample properties of simultaneous confidence bands and specification tests are investigated in simulations. To facilitate and enhance its application, the method has been implemented in the R package AdaptFitOS. Journal: Journal of the American Statistical Association Pages: 1286-1296 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.682809 File-URL: http://hdl.handle.net/10.1080/01621459.2012.682809 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1286-1296 Template-Type: ReDIF-Article 1.0 Author-Name: Martin A. Lindquist Author-X-Name-First: Martin A. Author-X-Name-Last: Lindquist Title: Functional Causal Mediation Analysis With an Application to Brain Connectivity Abstract: Mediation analysis is often used in the behavioral sciences to investigate the role of intermediate variables that lie on the causal path between a randomized treatment and an outcome variable. Typically, mediation is assessed using structural equation models (SEMs), with model coefficients interpreted as causal effects. In this article, we present an extension of SEMs to the functional data analysis (FDA) setting that allows the mediating variable to be a continuous function rather than a single scalar measure, thus providing the opportunity to study the functional effects of the mediator on the outcome. We provide sufficient conditions for identifying the average causal effects of the functional mediators using the extended SEM, as well as weaker conditions under which an instrumental variable estimand may be interpreted as an effect. The method is applied to data from a functional magnetic resonance imaging (fMRI) study of thermal pain that sought to determine whether activation in certain brain regions mediated the effect of applied temperature on self-reported pain. Our approach provides valuable information about the timing of the mediating effect that is not readily available when using the standard nonfunctional approach. To the best of our knowledge, this work provides the first application of causal inference to the FDA framework. Journal: Journal of the American Statistical Association Pages: 1297-1309 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.695640 File-URL: http://hdl.handle.net/10.1080/01621459.2012.695640 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1297-1309 Template-Type: ReDIF-Article 1.0 Author-Name: Sandra M. Mohammed Author-X-Name-First: Sandra M. Author-X-Name-Last: Mohammed Author-Name: Damla Şentürk Author-X-Name-First: Damla Author-X-Name-Last: Şentürk Author-Name: Lorien S. Dalrymple Author-X-Name-First: Lorien S. Author-X-Name-Last: Dalrymple Author-Name: Danh V. Nguyen Author-X-Name-First: Danh V. Author-X-Name-Last: Nguyen Title: Measurement Error Case Series Models With Application to Infection-Cardiovascular Risk in Older Patients on Dialysis Abstract: Infection and cardiovascular disease are leading causes of hospitalization and death in older patients on dialysis. Our recent work found an increase in the relative incidence of cardiovascular outcomes during the ∼ 30 days after infection-related hospitalizations using the case series model, which adjusts for measured and unmeasured baseline confounders. However, a major challenge in modeling/assessing the infection-cardiovascular risk hypothesis is that the exact time of infection, or more generally “exposure,” onsets cannot be ascertained based on hospitalization data. Only imprecise markers of the timing of infection onsets are available. Although there is a large literature on measurement error in the predictors in regression modeling, to date, there is no work on measurement error on the timing of a time-varying exposure to our knowledge. Thus, we propose a new method, the measurement error case series (MECS) models, to account for measurement error in time-varying exposure onsets. We characterized the general nature of bias resulting from estimation that ignores measurement error and proposed a bias-corrected estimation for the MECS models. We examined in detail the accuracy of the proposed method to estimate the relative incidence of cardiovascular events. Hospitalization data from the United States Renal Data System, which captures nearly all (>99%) patients with end-stage renal disease in the United States over time, are used to illustrate the proposed method. The results suggest that the estimate of the relative incidence of cardiovascular events during the 30 days after infections, a period where acute effects of infection on vascular endothelium may be most pronounced, is substantially attenuated in the presence of infection onset measurement error. Journal: Journal of the American Statistical Association Pages: 1310-1323 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.695648 File-URL: http://hdl.handle.net/10.1080/01621459.2012.695648 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1310-1323 Template-Type: ReDIF-Article 1.0 Author-Name: Yuanjia Wang Author-X-Name-First: Yuanjia Author-X-Name-Last: Wang Author-Name: Tanya P. Garcia Author-X-Name-First: Tanya P. Author-X-Name-Last: Garcia Author-Name: Yanyuan Ma Author-X-Name-First: Yanyuan Author-X-Name-Last: Ma Title: Nonparametric Estimation for Censored Mixture Data With Application to the Cooperative Huntington’s Observational Research Trial Abstract: This work presents methods for estimating genotype-specific outcome distributions from genetic epidemiology studies where the event times are subject to right censoring, the genotypes are not directly observed, and the data arise from a mixture of scientifically meaningful subpopulations. Examples of such studies include kin-cohort studies and quantitative trait locus (QTL) studies. Current methods for analyzing censored mixture data include two types of nonparametric maximum likelihood estimators (NPMLEs; Type I and Type II) that do not make parametric assumptions on the genotype-specific density functions. Although both NPMLEs are commonly used, we show that one is inefficient and the other inconsistent. To overcome these deficiencies, we propose three classes of consistent nonparametric estimators that do not assume parametric density models and are easy to implement. They are based on inverse probability weighting (IPW), augmented IPW (AIPW), and nonparametric imputation (IMP). AIPW achieves the efficiency bound without additional modeling assumptions. Extensive simulation experiments demonstrate satisfactory performance of these estimators even when the data are heavily censored. We apply these estimators to the Cooperative Huntington’s Observational Research Trial (COHORT), and provide age-specific estimates of the effect of mutation in the Huntington gene on mortality using a sample of family members. The close approximation of the estimated noncarrier survival rates to that of the U.S. population indicates small ascertainment bias in the COHORT family sample. Our analyses underscore an elevated risk of death in Huntington gene mutation carriers compared with that in noncarriers for a wide age range, and suggest that the mutation equally affects survival rates in both genders. The estimated survival rates are useful in genetic counseling for providing guidelines on interpreting the risk of death associated with a positive genetic test, and in helping future subjects at risk to make informed decisions on whether to undergo genetic mutation testing. Technical details and additional numerical results are provided in the online supplementary materials. Journal: Journal of the American Statistical Association Pages: 1324-1338 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.699353 File-URL: http://hdl.handle.net/10.1080/01621459.2012.699353 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1324-1338 Template-Type: ReDIF-Article 1.0 Author-Name: Hsiu-Hsi Chen Author-X-Name-First: Hsiu-Hsi Author-X-Name-Last: Chen Author-Name: Amy Ming-Fang Yen Author-X-Name-First: Amy Ming-Fang Author-X-Name-Last: Yen Author-Name: Laszlo Tabár Author-X-Name-First: Laszlo Author-X-Name-Last: Tabár Title: A Stochastic Model for Calibrating the Survival Benefit of Screen-Detected Cancers Abstract: Comparison of the survival of clinically detected and screen-detected cancer cases from either population-based service screening programs or opportunistic screening is often distorted by both lead-time and length biases. Both are correlated with each other and are also affected by measurement errors and tumor attributes such as regional lymph node spread. We propose a general stochastic approach to calibrate the survival benefit of screen-detected cancers related to both biases, measurement errors, and tumor attributes. We apply our proposed method to breast cancer screening data from one arm of the Swedish Two-County trial in the trial period together with the subsequent service screening for the same cohort. When there is no calibration, the results—assuming a constant (exponentially distributed) post-lead-time hazard rate (i.e., a homogeneous stochastic process)—show a 57% reduction in breast cancer death over 25 years. After correction, the reduction was 30%, with approximately 12% of the overestimation being due to lead-time bias and 15% due to length bias. The additional impacts of measurement errors (sensitivity and specificity) depend on the type of the proposed model and follow-up time. The corresponding analysis when the Weibull distribution was applied—relaxing the assumption of a constant hazard rate—yielded similar findings and lacked statistical significance compared with the exponential model. The proposed calibration approach allows the benefit of a service cancer screening program to be fairly evaluated. This article has supplementary materials online. Journal: Journal of the American Statistical Association Pages: 1339-1359 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.716335 File-URL: http://hdl.handle.net/10.1080/01621459.2012.716335 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1339-1359 Template-Type: ReDIF-Article 1.0 Author-Name: José R. Zubizarreta Author-X-Name-First: José R. Author-X-Name-Last: Zubizarreta Title: Using Mixed Integer Programming for Matching in an Observational Study of Kidney Failure After Surgery Abstract: This article presents a new method for optimal matching in observational studies based on mixed integer programming. Unlike widely used matching methods based on network algorithms, which attempt to achieve covariate balance by minimizing the total sum of distances between treated units and matched controls, this new method achieves covariate balance directly, either by minimizing both the total sum of distances and a weighted sum of specific measures of covariate imbalance, or by minimizing the total sum of distances while constraining the measures of imbalance to be less than or equal to certain tolerances. The inclusion of these extra terms in the objective function or the use of these additional constraints explicitly optimizes or constrains the criteria that will be used to evaluate the quality of the match. For example, the method minimizes or constrains differences in univariate moments, such as means, variances, and skewness; differences in multivariate moments, such as correlations between covariates; differences in quantiles; and differences in statistics, such as the Kolmogorov--Smirnov statistic, to minimize the differences in both location and shape of the empirical distributions of the treated units and matched controls. While balancing several of these measures, it is also possible to impose constraints for exact and near-exact matching, and fine and near-fine balance for more than one nominal covariate, whereas network algorithms can finely or near-finely balance only a single nominal covariate. From a practical standpoint, this method eliminates the guesswork involved in current optimal matching methods, and offers a controlled and systematic way of improving covariate balance by focusing the matching efforts on certain measures of covariate imbalance and their corresponding weights or tolerances. A matched case--control study of acute kidney injury after surgery among Medicare patients illustrates these features in detail. A new R package called mipmatch implements the method. Journal: Journal of the American Statistical Association Pages: 1360-1371 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.703874 File-URL: http://hdl.handle.net/10.1080/01621459.2012.703874 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1360-1371 Template-Type: ReDIF-Article 1.0 Author-Name: Donatello Telesca Author-X-Name-First: Donatello Author-X-Name-Last: Telesca Author-Name: Peter Müller Author-X-Name-First: Peter Author-X-Name-Last: Müller Author-Name: Steven M. Kornblau Author-X-Name-First: Steven M. Author-X-Name-Last: Kornblau Author-Name: Marc A. Suchard Author-X-Name-First: Marc A. Author-X-Name-Last: Suchard Author-Name: Yuan Ji Author-X-Name-First: Yuan Author-X-Name-Last: Ji Title: Modeling Protein Expression and Protein Signaling Pathways Abstract: High-throughput functional proteomic technologies provide a way to quantify the expression of proteins of interest. Statistical inference centers on identifying the activation state of proteins and their patterns of molecular interaction formalized as dependence structure. Inference on dependence structure is particularly important when proteins are selected because they are part of a common molecular pathway. In that case, inference on dependence structure reveals properties of the underlying pathway. We propose a probability model that represents molecular interactions at the level of hidden binary latent variables that can be interpreted as indicators for active versus inactive states of the proteins. The proposed approach exploits available expert knowledge about the target pathway to define an informative prior on the hidden conditional dependence structure. An important feature of this prior is that it provides an instrument to explicitly anchor the model space to a set of interactions of interest, favoring a local search approach to model determination. We apply our model to reverse-phase protein array data from a study on acute myeloid leukemia. Our inference identifies relevant subpathways in relation to the unfolding of the biological process under study. Journal: Journal of the American Statistical Association Pages: 1372-1384 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.706121 File-URL: http://hdl.handle.net/10.1080/01621459.2012.706121 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1372-1384 Template-Type: ReDIF-Article 1.0 Author-Name: Daniel Manrique-Vallier Author-X-Name-First: Daniel Author-X-Name-Last: Manrique-Vallier Author-Name: Jerome P. Reiter Author-X-Name-First: Jerome P. Author-X-Name-Last: Reiter Title: Estimating Identification Disclosure Risk Using Mixed Membership Models Abstract: Statistical agencies and other organizations that disseminate data are obligated to protect data subjects’ confidentiality. For example, ill-intentioned individuals might link data subjects to records in other databases by matching on common characteristics (keys). Successful links are particularly problematic for data subjects with combinations of keys that are unique in the population. Hence, as part of their assessments of disclosure risks, many data stewards estimate the probabilities that sample uniques on sets of discrete keys are also population uniques on those keys. This is typically done using log-linear modeling on the keys. However, log-linear models can yield biased estimates of cell probabilities for sparse contingency tables with many zero counts, which often occurs in databases with many keys. This bias can result in unreliable estimates of probabilities of uniqueness and, hence, misrepresentations of disclosure risks. We propose an alternative to log-linear models for datasets with sparse keys based on a Bayesian version of grade of membership (GoM) models. We present a Bayesian GoM model for multinomial variables and offer a Markov chain Monte Carlo algorithm for fitting the model. We evaluate the approach by treating data from a recent U.S. Census Bureau public use microdata sample as a population, taking simple random samples from that population, and benchmarking estimated probabilities of uniqueness against population values. Compared to log-linear models, GoM models provide more accurate estimates of the total number of uniques in the samples. Additionally, they offer record-level predictions of uniqueness that dominate those based on log-linear models. This article has online supplementary materials. Journal: Journal of the American Statistical Association Pages: 1385-1394 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.710508 File-URL: http://hdl.handle.net/10.1080/01621459.2012.710508 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1385-1394 Template-Type: ReDIF-Article 1.0 Author-Name: Michelle R. Danaher Author-X-Name-First: Michelle R. Author-X-Name-Last: Danaher Author-Name: Anindya Roy Author-X-Name-First: Anindya Author-X-Name-Last: Roy Author-Name: Zhen Chen Author-X-Name-First: Zhen Author-X-Name-Last: Chen Author-Name: Sunni L. Mumford Author-X-Name-First: Sunni L. Author-X-Name-Last: Mumford Author-Name: Enrique F. Schisterman Author-X-Name-First: Enrique F. Author-X-Name-Last: Schisterman Title: Minkowski--Weyl Priors for Models With Parameter Constraints: An Analysis of the BioCycle Study Abstract: We propose a general framework for performing full Bayesian analysis under linear inequality parameter constraints. The proposal is motivated by the BioCycle Study, a large cohort study of hormone levels of healthy women where certain well-established linear inequality constraints on the log-hormone levels should be accounted for in the statistical inferential procedure. Based on the Minkowski--Weyl decomposition of polyhedral regions, we propose a class of priors that are fully supported on the parameter space with linear inequality constraints, and we fit a Bayesian linear mixed model to the BioCycle data using such a prior. We observe positive associations between estrogen and progesterone levels and F2-isoprostanes, a marker for oxidative stress. These findings are of particular interest to reproductive epidemiologists. Journal: Journal of the American Statistical Association Pages: 1395-1409 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.712414 File-URL: http://hdl.handle.net/10.1080/01621459.2012.712414 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1395-1409 Template-Type: ReDIF-Article 1.0 Author-Name: Vanja Dukic Author-X-Name-First: Vanja Author-X-Name-Last: Dukic Author-Name: Hedibert F. Lopes Author-X-Name-First: Hedibert F. Author-X-Name-Last: Lopes Author-Name: Nicholas G. Polson Author-X-Name-First: Nicholas G. Author-X-Name-Last: Polson Title: Tracking Epidemics With Google Flu Trends Data and a State-Space SEIR Model Abstract: In this article, we use Google Flu Trends data together with a sequential surveillance model based on state-space methodology to track the evolution of an epidemic process over time. We embed a classical mathematical epidemiology model [a susceptible-exposed-infected-recovered (SEIR) model] within the state-space framework, thereby extending the SEIR dynamics to allow changes through time. The implementation of this model is based on a particle filtering algorithm, which learns about the epidemic process sequentially through time and provides updated estimated odds of a pandemic with each new surveillance data point. We show how our approach, in combination with sequential Bayes factors, can serve as an online diagnostic tool for influenza pandemic. We take a close look at the Google Flu Trends data describing the spread of flu in the United States during 2003--2009 and in nine separate U.S. states chosen to represent a wide range of health care and emergency system strengths and weaknesses. This article has online supplementary materials. Journal: Journal of the American Statistical Association Pages: 1410-1426 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.713876 File-URL: http://hdl.handle.net/10.1080/01621459.2012.713876 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1410-1426 Template-Type: ReDIF-Article 1.0 Author-Name: Donatello Telesca Author-X-Name-First: Donatello Author-X-Name-Last: Telesca Author-Name: Elena A. Erosheva Author-X-Name-First: Elena A. Author-X-Name-Last: Erosheva Author-Name: Derek A. Kreager Author-X-Name-First: Derek A. Author-X-Name-Last: Kreager Author-Name: Ross L. Matsueda Author-X-Name-First: Ross L. Author-X-Name-Last: Matsueda Title: Modeling Criminal Careers as Departures From a Unimodal Population Age--Crime Curve: The Case of Marijuana Use Abstract: A major aim of longitudinal analyses of life-course data is to describe the within- and between-individual variability in a behavioral outcome, such as crime. Statistical analyses of such data typically draw on mixture and mixed-effects growth models. In this work, we present a functional analytic point of view and develop an alternative method that models individual crime trajectories as departures from a population age--crime curve. Drawing on empirical and theoretical claims in criminology, we assume a unimodal population age--crime curve and allow individual expected crime trajectories to differ by their levels of offending and patterns of temporal misalignment. We extend Bayesian hierarchical curve registration methods to accommodate count data and to incorporate influence of baseline covariates on individual behavioral trajectories. Analyzing self-reported counts of yearly marijuana use from the Denver Youth Survey, we examine the influence of race and gender categories on differences in levels and timing of marijuana smoking. We find that our approach offers a flexible model for longitudinal crime trajectories and allows for a rich array of inferences of interest to criminologists and drug abuse researchers. This article has supplementary materials online. Journal: Journal of the American Statistical Association Pages: 1427-1440 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.716328 File-URL: http://hdl.handle.net/10.1080/01621459.2012.716328 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1427-1440 Template-Type: ReDIF-Article 1.0 Author-Name: Summer S. Han Author-X-Name-First: Summer S. Author-X-Name-Last: Han Author-Name: Philip S. Rosenberg Author-X-Name-First: Philip S. Author-X-Name-Last: Rosenberg Author-Name: Nilanjan Chatterjee Author-X-Name-First: Nilanjan Author-X-Name-Last: Chatterjee Title: Testing for Gene--Environment and Gene--Gene Interactions Under Monotonicity Constraints Abstract: Recent genome-wide association studies (GWASs) designed to detect the main effects of genetic markers have had considerable success with many findings validated by replication studies. However, relatively few findings of gene--gene or gene--environment interactions have been successfully reproduced. Besides the main issues associated with insufficient sample size in current studies, a complication is that interactions that rank high based on p-values often correspond to extreme forms of joint effects that are biologically less plausible. To reduce false positives and to increase power, we develop various gene--environment/gene--gene tests based on biologically more plausible constraints using bivariate isotonic regressions for case--control data. We extend our method to exploit gene--environment or gene--gene independence information, integrating the approach proposed by Chatterjee and Carroll. We propose appropriate nonparametric and parametric permutation procedures for evaluating the significance of the tests. Simulations show that our method gains power over traditional unconstrained methods by reducing the sizes of alternative parameter spaces. We apply our method to several real-data examples, including an analysis of bladder cancer data to detect interactions between the NAT2 gene and smoking. We also show that the proposed method is computationally feasible for large-scale problems by applying it to the National Cancer Institute (NCI) lung cancer GWAS data. Journal: Journal of the American Statistical Association Pages: 1441-1452 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.726892 File-URL: http://hdl.handle.net/10.1080/01621459.2012.726892 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1441-1452 Template-Type: ReDIF-Article 1.0 Author-Name: Huixia Judy Wang Author-X-Name-First: Huixia Judy Author-X-Name-Last: Wang Author-Name: Deyuan Li Author-X-Name-First: Deyuan Author-X-Name-Last: Li Author-Name: Xuming He Author-X-Name-First: Xuming Author-X-Name-Last: He Title: Estimation of High Conditional Quantiles for Heavy-Tailed Distributions Abstract: Estimation of conditional quantiles at very high or low tails is of interest in numerous applications. Quantile regression provides a convenient and natural way of quantifying the impact of covariates at different quantiles of a response distribution. However, high tails are often associated with data sparsity, so quantile regression estimation can suffer from high variability at tails especially for heavy-tailed distributions. In this article, we develop new estimation methods for high conditional quantiles by first estimating the intermediate conditional quantiles in a conventional quantile regression framework and then extrapolating these estimates to the high tails based on reasonable assumptions on tail behaviors. We establish the asymptotic properties of the proposed estimators and demonstrate through simulation studies that the proposed methods enjoy higher accuracy than the conventional quantile regression estimates. In a real application involving statistical downscaling of daily precipitation in the Chicago area, the proposed methods provide more stable results quantifying the chance of heavy precipitation in the area. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1453-1464 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.716382 File-URL: http://hdl.handle.net/10.1080/01621459.2012.716382 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1453-1464 Template-Type: ReDIF-Article 1.0 Author-Name: Xianchao Xie Author-X-Name-First: Xianchao Author-X-Name-Last: Xie Author-Name: S. C. Kou Author-X-Name-First: S. C. Author-X-Name-Last: Kou Author-Name: Lawrence D. Brown Author-X-Name-First: Lawrence D. Author-X-Name-Last: Brown Title: SURE Estimates for a Heteroscedastic Hierarchical Model Abstract: Hierarchical models are extensively studied and widely used in statistics and many other scientific areas. They provide an effective tool for combining information from similar resources and achieving partial pooling of inference. Since the seminal work by James and Stein (1961) and Stein (1962), shrinkage estimation has become one major focus for hierarchical models. For the homoscedastic normal model, it is well known that shrinkage estimators, especially the James-Stein estimator, have good risk properties. The heteroscedastic model, though more appropriate for practical applications, is less well studied, and it is unclear what types of shrinkage estimators are superior in terms of the risk. We propose in this article a class of shrinkage estimators based on Stein’s unbiased estimate of risk (SURE). We study asymptotic properties of various common estimators as the number of means to be estimated grows (p → ∞). We establish the asymptotic optimality property for the SURE estimators. We then extend our construction to create a class of semiparametric shrinkage estimators and establish corresponding asymptotic optimality results. We emphasize that though the form of our SURE estimators is partially obtained through a normal model at the sampling level, their optimality properties do not heavily depend on such distributional assumptions. We apply the methods to two real datasets and obtain encouraging results. Journal: Journal of the American Statistical Association Pages: 1465-1479 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.728154 File-URL: http://hdl.handle.net/10.1080/01621459.2012.728154 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1465-1479 Template-Type: ReDIF-Article 1.0 Author-Name: Lingzhou Xue Author-X-Name-First: Lingzhou Author-X-Name-Last: Xue Author-Name: Shiqian Ma Author-X-Name-First: Shiqian Author-X-Name-Last: Ma Author-Name: Hui Zou Author-X-Name-First: Hui Author-X-Name-Last: Zou Title: Positive-Definite ℓ1-Penalized Estimation of Large Covariance Matrices Abstract: The thresholding covariance estimator has nice asymptotic properties for estimating sparse large covariance matrices, but it often has negative eigenvalues when used in real data analysis. To fix this drawback of thresholding estimation, we develop a positive-definite ℓ1-penalized covariance estimator for estimating sparse large covariance matrices. We derive an efficient alternating direction method to solve the challenging optimization problem and establish its convergence properties. Under weak regularity conditions, nonasymptotic statistical theory is also established for the proposed estimator. The competitive finite-sample performance of our proposal is demonstrated by both simulation and real applications. Journal: Journal of the American Statistical Association Pages: 1480-1491 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.725386 File-URL: http://hdl.handle.net/10.1080/01621459.2012.725386 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1480-1491 Template-Type: ReDIF-Article 1.0 Author-Name: Layla Parast Author-X-Name-First: Layla Author-X-Name-Last: Parast Author-Name: Su-Chun Cheng Author-X-Name-First: Su-Chun Author-X-Name-Last: Cheng Author-Name: Tianxi Cai Author-X-Name-First: Tianxi Author-X-Name-Last: Cai Title: Landmark Prediction of Long-Term Survival Incorporating Short-Term Event Time Information Abstract: In recent years, a wide range of markers have become available as potential tools to predict risk or progression of disease. In addition to such biological and genetic markers, short-term outcome information may be useful in predicting long-term disease outcomes. When such information is available, it would be desirable to combine this along with predictive markers to improve the prediction of long-term survival. Most existing methods for incorporating censored short-term event information in predicting long-term survival focus on modeling the disease process and are derived under restrictive parametric models in a multistate survival setting. When such model assumptions fail to hold, the resulting prediction of long-term outcomes may be invalid or inaccurate. When there is only a single discrete baseline covariate, a fully nonparametric estimation procedure to incorporate short-term event time information has been previously proposed. However, such an approach is not feasible for settings with one or more continuous covariates due to the curse of dimensionality. In this article, we propose to incorporate short-term event time information along with multiple covariates collected up to a landmark point via a flexible varying-coefficient model. To evaluate and compare the prediction performance of the resulting landmark prediction rule, we use robust nonparametric procedures that do not require the correct specification of the proposed varying-coefficient model. Simulation studies suggest that the proposed procedures perform well in finite samples. We illustrate them here using a dataset of postdialysis patients with end-stage renal disease. Journal: Journal of the American Statistical Association Pages: 1492-1501 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.721281 File-URL: http://hdl.handle.net/10.1080/01621459.2012.721281 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1492-1501 Template-Type: ReDIF-Article 1.0 Author-Name: P. L. Davies Author-X-Name-First: P. L. Author-X-Name-Last: Davies Title: Interactions in the Analysis of Variance Abstract: The standard model for the analysis of variance is over-parameterized. The resulting identifiability problem is typically solved by placing linear constraints on the parameters. In the case of the interactions, these require that the marginal sums be zero. Although seemingly neutral, these conditions have unintended consequences: the interactions are of necessity connected whether or not this is justified, the minimum number of nonzero interactions is four, and, in particular, it is not possible to have a single interaction in one cell. There is no reason why nature should conform to these constraints. The approach taken in this article is one of sparsity: the linear factor effects are chosen so as to minimize the number of nonzero interactions subject to consistency with the data. The resulting interactions are attached to individual cells making their interpretation easier irrespective of whether they are isolated or form clusters. In general, the calculation of a sparse solution is a difficult combinatorial problem but the special nature of the analysis of variance simplifies matters considerably. In many cases, the sparse L 0 solution coincides with the L 1 solution obtained by minimizing the sum of the absolute residuals and that can be calculated quickly. The identity of the two solutions can be checked either algorithmically or by applying known sufficient conditions for equality. Journal: Journal of the American Statistical Association Pages: 1502-1509 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.726895 File-URL: http://hdl.handle.net/10.1080/01621459.2012.726895 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1502-1509 Template-Type: ReDIF-Article 1.0 Author-Name: Justin S. Dyer Author-X-Name-First: Justin S. Author-X-Name-Last: Dyer Author-Name: Art B. Owen Author-X-Name-First: Art B. Author-X-Name-Last: Owen Title: Correct Ordering in the Zipf--Poisson Ensemble Abstract: Rankings based on counts are often presented to identify popular items, such as baby names, English words, or Web sites. This article shows that, in some examples, the number of correctly identified items can be very small. We introduce a standard error versus rank plot to diagnose possible misrankings. Then to explain the slowly growing number of correct ranks, we model the entire set of count data via a Zipf--Poisson ensemble with independent Xi ∼ Poi(Ni -super-− α) for α > 1 and N > 0 and integers i ⩾ 1. We show that as N → ∞, the first n′(N) random variables have their proper order relative to each other, with probability tending to 1 for n′ up to (AN/log (N))-super-1/(α + 2) for A = α-super-2(α + 2)/4. We also show that the rate N -super-1/(α + 2) cannot be achieved. The ordering of the first n′(N) entities does not preclude for some interloping m > n′. However, we show that the first n″ random variables are correctly ordered exclusive of any interlopers, with probability tending to 1 if n″ ⩽ (BN/log (N))-super-1/(α + 2) for any B > A. We also show how to compute the cutoff for alternative models such as a Zipf--Mandelbrot--Poisson ensemble. Journal: Journal of the American Statistical Association Pages: 1510-1517 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.734177 File-URL: http://hdl.handle.net/10.1080/01621459.2012.734177 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1510-1517 Template-Type: ReDIF-Article 1.0 Author-Name: Lisha Chen Author-X-Name-First: Lisha Author-X-Name-Last: Chen Author-Name: Jianhua Z. Huang Author-X-Name-First: Jianhua Z. Author-X-Name-Last: Huang Title: Sparse Reduced-Rank Regression for Simultaneous Dimension Reduction and Variable Selection Abstract: The reduced-rank regression is an effective method in predicting multiple response variables from the same set of predictor variables. It reduces the number of model parameters and takes advantage of interrelations between the response variables and hence improves predictive accuracy. We propose to select relevant variables for reduced-rank regression by using a sparsity-inducing penalty. We apply a group-lasso type penalty that treats each row of the matrix of the regression coefficients as a group and show that this penalty satisfies certain desirable invariance properties. We develop two numerical algorithms to solve the penalized regression problem and establish the asymptotic consistency of the proposed method. In particular, the manifold structure of the reduced-rank regression coefficient matrix is considered and studied in our theoretical analysis. In our simulation study and real data analysis, the new method is compared with several existing variable selection methods for multivariate regression and exhibits competitive performance in prediction and variable selection. Journal: Journal of the American Statistical Association Pages: 1533-1545 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.734178 File-URL: http://hdl.handle.net/10.1080/01621459.2012.734178 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1533-1545 Template-Type: ReDIF-Article 1.0 Author-Name: S. C. Kou Author-X-Name-First: S. C. Author-X-Name-Last: Kou Author-Name: Benjamin P. Olding Author-X-Name-First: Benjamin P. Author-X-Name-Last: Olding Author-Name: Martin Lysy Author-X-Name-First: Martin Author-X-Name-Last: Lysy Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Title: A Multiresolution Method for Parameter Estimation of Diffusion Processes Abstract: Diffusion process models are widely used in science, engineering, and finance. Most diffusion processes are described by stochastic differential equations in continuous time. In practice, however, data are typically observed only at discrete time points. Except for a few very special cases, no analytic form exists for the likelihood of such discretely observed data. For this reason, parametric inference is often achieved by using discrete-time approximations, with accuracy controlled through the introduction of missing data. We present a new multiresolution Bayesian framework to address the inference difficulty. The methodology relies on the use of multiple approximations and extrapolation and is significantly faster and more accurate than known strategies based on Gibbs sampling. We apply the multiresolution approach to three data-driven inference problems, one of which features a multivariate diffusion model with an entirely unobserved component. Journal: Journal of the American Statistical Association Pages: 1558-1574 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.720899 File-URL: http://hdl.handle.net/10.1080/01621459.2012.720899 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1558-1574 Template-Type: ReDIF-Article 1.0 Author-Name: Ori Rosen Author-X-Name-First: Ori Author-X-Name-Last: Rosen Author-Name: Sally Wood Author-X-Name-First: Sally Author-X-Name-Last: Wood Author-Name: David S. Stoffer Author-X-Name-First: David S. Author-X-Name-Last: Stoffer Title: AdaptSPEC: Adaptive Spectral Estimation for Nonstationary Time Series Abstract: We propose a method for analyzing possibly nonstationary time series by adaptively dividing the time series into an unknown but finite number of segments and estimating the corresponding local spectra by smoothing splines. The model is formulated in a Bayesian framework, and the estimation relies on reversible jump Markov chain Monte Carlo (RJMCMC) methods. For a given segmentation of the time series, the likelihood function is approximated via a product of local Whittle likelihoods. Thus, no parametric assumption is made about the process underlying the time series. The number and lengths of the segments are assumed unknown and may change from one MCMC iteration to another. The frequentist properties of the method are investigated by simulation, and applications to electroencephalogram and the El Niño Southern Oscillation phenomenon are described in detail. Journal: Journal of the American Statistical Association Pages: 1575-1589 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.716340 File-URL: http://hdl.handle.net/10.1080/01621459.2012.716340 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1575-1589 Template-Type: ReDIF-Article 1.0 Author-Name: Kehui Chen Author-X-Name-First: Kehui Author-X-Name-Last: Chen Author-Name: Hans-Georg Müller Author-X-Name-First: Hans-Georg Author-X-Name-Last: Müller Title: Modeling Repeated Functional Observations Abstract: We introduce a new methodological framework for repeatedly observed and thus dependent functional data, aiming at situations where curves are recorded repeatedly for each subject in a sample. Our methodology covers the case where the recordings of the curves are scheduled on a regular and dense grid and also situations more typical for longitudinal studies, where the timing of recordings is often sparse and random. The proposed models lead to an interpretable and straightforward decomposition of the inherent variation in repeatedly observed functional data and are implemented through a straightforward two-step functional principal component analysis. We provide consistency results and asymptotic convergence rates for the estimated model components. We compare the proposed model with an alternative approach via a two-dimensional Karhunen-Loève expansion and illustrate it through the analysis of longitudinal mortality data from period lifetables that are repeatedly observed for a sample of countries over many years, and also through simulation studies. This article has online supplementary materials. Journal: Journal of the American Statistical Association Pages: 1599-1609 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.734196 File-URL: http://hdl.handle.net/10.1080/01621459.2012.734196 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1599-1609 Template-Type: ReDIF-Article 1.0 Author-Name: Howard D. Bondell Author-X-Name-First: Howard D. Author-X-Name-Last: Bondell Author-Name: Brian J. Reich Author-X-Name-First: Brian J. Author-X-Name-Last: Reich Title: Consistent High-Dimensional Bayesian Variable Selection via Penalized Credible Regions Abstract: For high-dimensional data, particularly when the number of predictors greatly exceeds the sample size, selection of relevant predictors for regression is a challenging problem. Methods such as sure screening, forward selection, or penalized regressions are commonly used. Bayesian variable selection methods place prior distributions on the parameters along with a prior over model space, or equivalently, a mixture prior on the parameters having mass at zero. Since exhaustive enumeration is not feasible, posterior model probabilities are often obtained via long Markov chain Monte Carlo (MCMC) runs. The chosen model can depend heavily on various choices for priors and also posterior thresholds. Alternatively, we propose a conjugate prior only on the full model parameters and use sparse solutions within posterior credible regions to perform selection. These posterior credible regions often have closed-form representations, and it is shown that these sparse solutions can be computed via existing algorithms. The approach is shown to outperform common methods in the high-dimensional setting, particularly under correlation. By searching for a sparse solution within a joint credible region, consistent model selection is established. Furthermore, it is shown that, under certain conditions, the use of marginal credible intervals can give consistent selection up to the case where the dimension grows exponentially in the sample size. The proposed approach successfully accomplishes variable selection in the high-dimensional setting, while avoiding pitfalls that plague typical Bayesian variable selection methods. Journal: Journal of the American Statistical Association Pages: 1610-1624 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.716344 File-URL: http://hdl.handle.net/10.1080/01621459.2012.716344 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1610-1624 Template-Type: ReDIF-Article 1.0 Author-Name: Haipeng Xing Author-X-Name-First: Haipeng Author-X-Name-Last: Xing Author-Name: Zhiliang Ying Author-X-Name-First: Zhiliang Author-X-Name-Last: Ying Title: A Semiparametric Change-Point Regression Model for Longitudinal Observations Abstract: Many longitudinal studies involve relating an outcome process to a set of possibly time-varying covariates, giving rise to the usual regression models for longitudinal data. When the purpose of the study is to investigate the covariate effects when experimental environment undergoes abrupt changes or to locate the periods with different levels of covariate effects, a simple and easy-to-interpret approach is to introduce change-points in regression coefficients. In this connection, we propose a semiparametric change-point regression model, in which the error process (stochastic component) is nonparametric and the baseline mean function (functional part) is completely unspecified, the observation times are allowed to be subject specific, and the number, locations, and magnitudes of change-points are unknown and need to be estimated. We further develop an estimation procedure that combines the recent advance in semiparametric analysis based on counting process argument and multiple change-points inference and discuss its large sample properties, including consistency and asymptotic normality, under suitable regularity conditions. Simulation results show that the proposed methods work well under a variety of scenarios. An application to a real dataset is also given. Journal: Journal of the American Statistical Association Pages: 1625-1637 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.712425 File-URL: http://hdl.handle.net/10.1080/01621459.2012.712425 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1625-1637 Template-Type: ReDIF-Article 1.0 Author-Name: Paul S. Clarke Author-X-Name-First: Paul S. Author-X-Name-Last: Clarke Author-Name: Frank Windmeijer Author-X-Name-First: Frank Author-X-Name-Last: Windmeijer Title: Instrumental Variable Estimators for Binary Outcomes Abstract: Instrumental variables (IVs) can be used to construct estimators of exposure effects on the outcomes of studies affected by nonignorable selection of the exposure. Estimators that fail to adjust for the effects of nonignorable selection will be biased and inconsistent. Such situations commonly arise in observational studies, but are also a problem for randomized experiments affected by nonignorable noncompliance. In this article, we review IV estimators for studies in which the outcome is binary, and consider the links between different approaches developed in the statistics and econometrics literatures. The implicit assumptions made by each method are highlighted and compared within our framework. We illustrate our findings through the reanalysis of a randomized placebo-controlled trial, and highlight important directions for future work in this area. Journal: Journal of the American Statistical Association Pages: 1638-1652 Issue: 500 Volume: 107 Year: 2012 Month: 12 X-DOI: 10.1080/01621459.2012.734171 File-URL: http://hdl.handle.net/10.1080/01621459.2012.734171 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1638-1652 Template-Type: ReDIF-Article 1.0 Author-Name: Robert N. Rodriguez Author-X-Name-First: Robert N. Author-X-Name-Last: Rodriguez Title: Building the Big Tent for Statistics Journal: Journal of the American Statistical Association Pages: 1-6 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2013.771010 File-URL: http://hdl.handle.net/10.1080/01621459.2013.771010 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:1-6 Template-Type: ReDIF-Article 1.0 Author-Name: Haeran Cho Author-X-Name-First: Haeran Author-X-Name-Last: Cho Author-Name: Yannig Goude Author-X-Name-First: Yannig Author-X-Name-Last: Goude Author-Name: Xavier Brossat Author-X-Name-First: Xavier Author-X-Name-Last: Brossat Author-Name: Qiwei Yao Author-X-Name-First: Qiwei Author-X-Name-Last: Yao Title: Modeling and Forecasting Daily Electricity Load Curves: A Hybrid Approach Abstract: We propose a hybrid approach for the modeling and the short-term forecasting of electricity loads. Two building blocks of our approach are (1) modeling the overall trend and seasonality by fitting a generalized additive model to the weekly averages of the load and (2) modeling the dependence structure across consecutive daily loads via curve linear regression. For the latter, a new methodology is proposed for linear regression with both curve response and curve regressors. The key idea behind the proposed methodology is dimension reduction based on a singular value decomposition in a Hilbert space, which reduces the curve regression problem to several ordinary (i.e., scalar) linear regression problems. We illustrate the hybrid method using French electricity loads between 1996 and 2009, on which we also compare our method with other available models including the Électricité de France operational model. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 7-21 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.722900 File-URL: http://hdl.handle.net/10.1080/01621459.2012.722900 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:7-21 Template-Type: ReDIF-Article 1.0 Author-Name: Ephraim M. Hanks Author-X-Name-First: Ephraim M. Author-X-Name-Last: Hanks Author-Name: Mevin B. Hooten Author-X-Name-First: Mevin B. Author-X-Name-Last: Hooten Title: Circuit Theory and Model-Based Inference for Landscape Connectivity Abstract: Circuit theory has seen extensive recent use in the field of ecology, where it is often applied to study functional connectivity. The landscape is typically represented by a network of nodes and resistors, with the resistance between nodes a function of landscape characteristics. The effective distance between two locations on a landscape is represented by the resistance distance between the nodes in the network. Circuit theory has been applied to many other scientific fields for exploratory analyses, but parametric models for circuits are not common in the scientific literature. To model circuits explicitly, we demonstrate a link between Gaussian Markov random fields and contemporary circuit theory using a covariance structure that induces the necessary resistance distance. This provides a parametric model for second-order observations from such a system. In the landscape ecology setting, the proposed model provides a simple framework where inference can be obtained for effects that landscape features have on functional connectivity. We illustrate the approach through a landscape genetics study linking gene flow in alpine chamois (Rupicapra rupicapra) to the underlying landscape. Journal: Journal of the American Statistical Association Pages: 22-33 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.724647 File-URL: http://hdl.handle.net/10.1080/01621459.2012.724647 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:22-33 Template-Type: ReDIF-Article 1.0 Author-Name: Roee Gutman Author-X-Name-First: Roee Author-X-Name-Last: Gutman Author-Name: Christopher C. Afendulis Author-X-Name-First: Christopher C. Author-X-Name-Last: Afendulis Author-Name: Alan M. Zaslavsky Author-X-Name-First: Alan M. Author-X-Name-Last: Zaslavsky Title: A Bayesian Procedure for File Linking to Analyze End-of-Life Medical Costs Abstract: End-of-life medical expenses are a significant proportion of all health care expenditures. These costs were studied using costs of services from Medicare claims and cause of death (CoD) from death certificates. In the absence of a unique identifier linking the two datasets, common variables identified unique matches for only 33% of deaths. The remaining cases formed cells with multiple cases (32% in cells with an equal number of cases from each file and 35% in cells with an unequal number). We sampled from the joint posterior distribution of model parameters and the permutations that link cases from the two files within each cell. The linking models included the regression of location of death on CoD and other parameters, and the regression of cost measures with a monotone missing data pattern on CoD and other demographic characteristics. Permutations were sampled by enumerating the exact distribution for small cells and by the Metropolis algorithm for large cells. Sparse matrix data structures enabled efficient calculations despite the large dataset (≈1.7 million cases). The procedure generates m datasets in which the matches between the two files are imputed. The m datasets can be analyzed independently and results can be combined using Rubin's multiple imputation rules. Our approach can be applied in other file-linking applications. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 34-47 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.726889 File-URL: http://hdl.handle.net/10.1080/01621459.2012.726889 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:34-47 Template-Type: ReDIF-Article 1.0 Author-Name: Man-Wai Ho Author-X-Name-First: Man-Wai Author-X-Name-Last: Ho Author-Name: Wanzhu Tu Author-X-Name-First: Wanzhu Author-X-Name-Last: Tu Author-Name: Pulak Ghosh Author-X-Name-First: Pulak Author-X-Name-Last: Ghosh Author-Name: Ram C. Tiwari Author-X-Name-First: Ram C. Author-X-Name-Last: Tiwari Title: A Nested Dirichlet Process Analysis of Cluster Randomized Trial Data With Application in Geriatric Care Assessment Abstract: In cluster randomized trials, patients seen by the same physician are randomized to the same treatment arm as a group. Besides the natural clustering of patients due to cluster/group randomization, interactions between an individual patient and the attending physician within the group could just as well influence patient care outcomes. Despite the intuitive relevance of these interactions to treatment assessment, few studies have thus far examined their influences. Whether and to what extent these interactions affect assessment of the treatment effect remains unexplored. In fact, few statistical models provide ready accommodation for such interactions. In this research, we propose a general modeling framework based on the nested Dirichlet process (nDP) for assessing treatment effect in cluster randomized trials. The proposed methodology explicitly accounts for physician--patient interactions by assuming that the interactions follow unspecified group-specific distributions from an nDP. In addition to accounting for physician--patient interactions, the model has greatly enhanced the flexibility of traditional mixed effect models by allowing for nonnormally distributed random effects, thus, alleviating concerns about mixed effect misspecification and sidestepping verification of distributional assumptions on random effects. At the same time, the model retains the mixed models' ability to make inferences on fixed effects. The proposed method is easily extendable to more complicated hierarchical clustering structures. We introduce the method in the context of a real cluster randomized trial. A comprehensive simulation study was conducted to assess the operating characteristics of the proposed nDP model. Journal: Journal of the American Statistical Association Pages: 48-68 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.734164 File-URL: http://hdl.handle.net/10.1080/01621459.2012.734164 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:48-68 Template-Type: ReDIF-Article 1.0 Author-Name: Riten Mitra Author-X-Name-First: Riten Author-X-Name-Last: Mitra Author-Name: Peter Müller Author-X-Name-First: Peter Author-X-Name-Last: Müller Author-Name: Shoudan Liang Author-X-Name-First: Shoudan Author-X-Name-Last: Liang Author-Name: Lu Yue Author-X-Name-First: Lu Author-X-Name-Last: Yue Author-Name: Yuan Ji Author-X-Name-First: Yuan Author-X-Name-Last: Ji Title: A Bayesian Graphical Model for ChIP-Seq Data on Histone Modifications Abstract: Histone modifications (HMs) are an important post-translational feature. Different types of HMs are believed to co-exist and co-regulate biological processes such as gene expression and, therefore, are intrinsically dependent on each other. We develop inference for this complex biological network of HMs based on a graphical model using ChIP-Seq data. A critical computational hurdle in the inference for the proposed graphical model is the evaluation of a normalization constant in an autologistic model that builds on the graphical model. We tackle the problem by Monte Carlo evaluation of ratios of normalization constants. We carry out a set of simulations to validate the proposed approach and to compare it with a standard approach using Bayesian networks. We report inference on HM dependence in a case study with ChIP-Seq data from a next generation sequencing experiment. An important feature of our approach is that we can report coherent probabilities and estimates related to any event or parameter of interest, including honest uncertainties. Posterior inference is obtained from a joint probability model on latent indicators for the recorded HMs. We illustrate this in the motivating case study. An R package including an implementation of posterior simulation in C is available from Riten Mitra upon request. Journal: Journal of the American Statistical Association Pages: 69-80 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.746058 File-URL: http://hdl.handle.net/10.1080/01621459.2012.746058 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:69-80 Template-Type: ReDIF-Article 1.0 Author-Name: Michael W. Robbins Author-X-Name-First: Michael W. Author-X-Name-Last: Robbins Author-Name: Sujit K. Ghosh Author-X-Name-First: Sujit K. Author-X-Name-Last: Ghosh Author-Name: Joshua D. Habiger Author-X-Name-First: Joshua D. Author-X-Name-Last: Habiger Title: Imputation in High-Dimensional Economic Data as Applied to the Agricultural Resource Management Survey Abstract: In this article, we consider imputation in the USDA's Agricultural Resource Management Survey (ARMS) data, which is a complex, high-dimensional economic dataset. We develop a robust joint model for ARMS data, which requires that variables are transformed using a suitable class of marginal densities (e.g., skew normal family). We assume that the transformed variables may be linked through a Gaussian copula, which enables construction of the joint model via a sequence of conditional linear models. We also discuss the criteria used to select the predictors for each conditional model. For the purpose of developing an imputation method that is conducive to these model assumptions, we propose a regression-based technique that allows for flexibility in the selection of conditional models while providing a valid joint distribution. In this procedure, labeled as iterative sequential regression (ISR), parameter estimates and imputations are obtained using a Markov chain Monte Carlo sampling method. Finally, we apply the proposed method to the full ARMS data, and we present a thorough data analysis that serves to gauge the appropriateness of the resulting imputations. Our results demonstrate the effectiveness of the proposed algorithm and illustrate the specific deficiencies of existing methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 81-95 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.734158 File-URL: http://hdl.handle.net/10.1080/01621459.2012.734158 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:81-95 Template-Type: ReDIF-Article 1.0 Author-Name: Mark C. Wheldon Author-X-Name-First: Mark C. Author-X-Name-Last: Wheldon Author-Name: Adrian E. Raftery Author-X-Name-First: Adrian E. Author-X-Name-Last: Raftery Author-Name: Samuel J. Clark Author-X-Name-First: Samuel J. Author-X-Name-Last: Clark Author-Name: Patrick Gerland Author-X-Name-First: Patrick Author-X-Name-Last: Gerland Title: Reconstructing Past Populations With Uncertainty From Fragmentary Data Abstract: Current methods for reconstructing human populations of the past by age and sex are deterministic or do not formally account for measurement error. We propose a method for simultaneously estimating age-specific population counts, fertility rates, mortality rates, and net international migration flows from fragmentary data that incorporates measurement error. Inference is based on joint posterior probability distributions that yield fully probabilistic interval estimates. It is designed for the kind of data commonly collected in modern demographic surveys and censuses. Population dynamics over the period of reconstruction are modeled by embedding formal demographic accounting relationships in a Bayesian hierarchical model. Informative priors are specified for vital rates, migration rates, population counts at baseline, and their respective measurement error variances. We investigate calibration of central posterior marginal probability intervals by simulation and demonstrate the method by reconstructing the female population of Burkina Faso from 1960 to 2005. Supplementary materials for this article are available online and the method is implemented in the R package "popReconstruct." Journal: Journal of the American Statistical Association Pages: 96-110 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.737729 File-URL: http://hdl.handle.net/10.1080/01621459.2012.737729 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:96-110 Template-Type: ReDIF-Article 1.0 Author-Name: Duchwan Ryu Author-X-Name-First: Duchwan Author-X-Name-Last: Ryu Author-Name: Faming Liang Author-X-Name-First: Faming Author-X-Name-Last: Liang Author-Name: Bani K. Mallick Author-X-Name-First: Bani K. Author-X-Name-Last: Mallick Title: Sea Surface Temperature Modeling using Radial Basis Function Networks With a Dynamically Weighted Particle Filter Abstract: The sea surface temperature (SST) is an important factor of the earth climate system. A deep understanding of SST is essential for climate monitoring and prediction. In general, SST follows a nonlinear pattern in both time and location and can be modeled by a dynamic system which changes with time and location. In this article, we propose a radial basis function network-based dynamic model which is able to catch the nonlinearity of the data and propose to use the dynamically weighted particle filter to estimate the parameters of the dynamic model. We analyze the SST observed in the Caribbean Islands area after a hurricane using the proposed dynamic model. Comparing to the traditional grid-based approach that requires a supercomputer due to its high computational demand, our approach requires much less CPU time and makes real-time forecasting of SST doable on a personal computer. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 111-123 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.734151 File-URL: http://hdl.handle.net/10.1080/01621459.2012.734151 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:111-123 Template-Type: ReDIF-Article 1.0 Author-Name: Drew A. Linzer Author-X-Name-First: Drew A. Author-X-Name-Last: Linzer Title: Dynamic Bayesian Forecasting of Presidential Elections in the States Abstract: I present a dynamic Bayesian forecasting model that enables early and accurate prediction of U.S. presidential election outcomes at the state level. The method systematically combines information from historical forecasting models in real time with results from the large number of state-level opinion surveys that are released publicly during the campaign. The result is a set of forecasts that are initially as good as the historical model, and then gradually increase in accuracy as Election Day nears. I employ a hierarchical specification to overcome the limitation that not every state is polled on every day, allowing the model to borrow strength both across states and, through the use of random-walk priors, across time. The model also filters away day-to-day variation in the polls due to sampling error and national campaign effects, which enables daily tracking of voter preferences toward the presidential candidates at the state and national levels. Simulation techniques are used to estimate the candidates' probability of winning each state and, consequently, a majority of votes in the Electoral College. I apply the model to preelection polls from the 2008 presidential campaign and demonstrate that the victory of Barack Obama was never realistically in doubt. Journal: Journal of the American Statistical Association Pages: 124-134 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.737735 File-URL: http://hdl.handle.net/10.1080/01621459.2012.737735 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:124-134 Template-Type: ReDIF-Article 1.0 Author-Name: Jesse Y. Hsu Author-X-Name-First: Jesse Y. Author-X-Name-Last: Hsu Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Author-Name: Paul R. Rosenbaum Author-X-Name-First: Paul R. Author-X-Name-Last: Rosenbaum Title: Effect Modification and Design Sensitivity in Observational Studies Abstract: In an observational study of treatment effects, subjects are not randomly assigned to treatment or control, so differing outcomes in treated and control groups may reflect a bias from nonrandom assignment rather than a treatment effect. After adjusting for measured pretreatment covariates, perhaps by matching, a sensitivity analysis determines the magnitude of bias from an unmeasured covariate that would need to be present to alter the conclusions of the naive analysis that presumes adjustments eliminated all bias. Other things being equal, larger effects tend to be less sensitive to bias than smaller effects. Effect modification is an interaction between a treatment and a pretreatment covariate controlled by matching, so that the treatment effect is larger at some values of the covariate than at others. In the presence of effect modification, it is possible that results are less sensitive to bias in subgroups experiencing larger effects. Two cases are considered: (i) an a priori grouping into a few categories based on covariates controlled by matching and (ii) a grouping discovered empirically in the data at hand. In case (i), subgroup specific bounds on p-values are combined using the truncated product of p-values. In case (ii), information that is fixed under the null hypothesis of no treatment effect is used to partition matched pairs in the hope of identifying pairs with larger effects. The methods are evaluated using an asymptotic device, the design sensitivity, and using simulation. Sensitivity analysis for a test of the global null hypothesis of no effect is converted to sensitivity analyses for subgroup analyses using closed testing. A study of an intervention to control malaria in Africa is used to illustrate. Journal: Journal of the American Statistical Association Pages: 135-148 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.742018 File-URL: http://hdl.handle.net/10.1080/01621459.2012.742018 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:135-148 Template-Type: ReDIF-Article 1.0 Author-Name: Edoardo M. Airoldi Author-X-Name-First: Edoardo M. Author-X-Name-Last: Airoldi Author-Name: Alexander W. Blocker Author-X-Name-First: Alexander W. Author-X-Name-Last: Blocker Title: Estimating Latent Processes on a Network From Indirect Measurements Abstract: In a communication network, point-to-point traffic volumes over time are critical for designing protocols that route information efficiently and for maintaining security, whether at the scale of an Internet service provider or within a corporation. While technically feasible, the direct measurement of point-to-point traffic imposes a heavy burden on network performance and is typically not implemented. Instead, indirect aggregate traffic volumes are routinely collected. We consider the problem of estimating point-to-point traffic volumes, , from aggregate traffic volumes, , given information about the network routing protocol encoded in a matrix A. This estimation task can be reformulated as finding the solutions to a sequence of ill-posed linear inverse problems, , since the number of origin-destination routes of interest is higher than the number of aggregate measurements available. Here, we introduce a novel multilevel state-space model (SSM) of aggregate traffic volumes with realistic features. We implement a naïve strategy for estimating unobserved point-to-point traffic volumes from indirect measurements of aggregate traffic, based on particle filtering. We then develop a more efficient two-stage inference strategy that relies on model-based regularization: a simple model is used to calibrate regularization parameters that lead to efficient/scalable inference in the multilevel SSM. We apply our methods to corporate and academic networks, where we show that the proposed inference strategy outperforms existing approaches and scales to larger networks. We also design a simulation study to explore the factors that influence the performance. Our results suggest that model-based regularization may be an efficient strategy for inference in other complex multilevel models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 149-164 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.756328 File-URL: http://hdl.handle.net/10.1080/01621459.2012.756328 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:149-164 Template-Type: ReDIF-Article 1.0 Author-Name: Asaf Weinstein Author-X-Name-First: Asaf Author-X-Name-Last: Weinstein Author-Name: William Fithian Author-X-Name-First: William Author-X-Name-Last: Fithian Author-Name: Yoav Benjamini Author-X-Name-First: Yoav Author-X-Name-Last: Benjamini Title: Selection Adjusted Confidence Intervals With More Power to Determine the Sign Abstract: In many current large-scale problems, confidence intervals (CIs) are constructed only for the parameters that are large, as indicated by their estimators, ignoring the smaller parameters. Such selective inference poses a problem to the usual marginal CIs that no longer offer the right level of coverage, not even on the average over the selected parameters. We address this problem by developing three methods to construct short and valid CIs for the location parameter of a symmetric unimodal distribution, while conditioning on its estimator being larger than some constant threshold. In two of these methods, the CI is further required to offer early sign determination, that is, to avoid including parameters of both signs for relatively small values of the estimator. One of the two, the Conditional Quasi-Conventional CI, offers a good balance between length and sign determination while protecting from the effect of selection. The CI is not symmetric, extending more toward 0 than away from it, nor is it of constant shape. However, when the estimator is far away from the threshold, the proposed CI tends to the usual marginal one. In spite of its complexity, it is specified by closed form expressions, up to a small set of constants that are each the solution of a single variable equation. When multiple testing procedures are used to control the false discovery rate or other error rates, the resulting threshold for selecting may be data dependent. We show that conditioning the above CIs on the data-dependent threshold still offers false coverage-statement rate (FCR) for many widely used testing procedures. For these reasons, the conditional CIs for the parameters selected this way are an attractive alternative to the available general FCR adjusted intervals. We demonstrate the use of the method in the analysis of some 14,000 correlations between hormone change and brain activity change in response to the subjects being exposed to stressful movie clips. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 165-176 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.737740 File-URL: http://hdl.handle.net/10.1080/01621459.2012.737740 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:165-176 Template-Type: ReDIF-Article 1.0 Author-Name: S. M. Schennach Author-X-Name-First: S. M. Author-X-Name-Last: Schennach Author-Name: Yingyao Hu Author-X-Name-First: Yingyao Author-X-Name-Last: Hu Title: Nonparametric Identification and Semiparametric Estimation of Classical Measurement Error Models Without Side Information Abstract: Virtually all methods aimed at correcting for covariate measurement error in regressions rely on some form of additional information (e.g., validation data, known error distributions, repeated measurements, or instruments). In contrast, we establish that the fully nonparametric classical errors-in-variables model is identifiable from data on the regressor and the dependent variable alone, unless the model takes a very specific parametric form. This parametric family includes (but is not limited to) the linear specification with normally distributed variables as a well-known special case. This result relies on standard primitive regularity conditions taking the form of smoothness constraints and nonvanishing characteristic functions' assumptions. Our approach can handle both monotone and nonmonotone specifications, provided the latter oscillate a finite number of times. Given that the very specific unidentified parametric functional form is arguably the exception rather than the rule, this identification result should have a wide applicability. It leads to a new perspective on handling measurement error in nonlinear and nonparametric models, opening the way to a novel and practical approach to correct for measurement error in datasets where it was previously considered impossible (due to the lack of additional information regarding the measurement error). We suggest an estimator based on non/semiparametric maximum likelihood, derive its asymptotic properties, and illustrate the effectiveness of the method with a simulation study and an application to the relationship between firm investment behavior and market value, the latter being notoriously mismeasured. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 177-186 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.751872 File-URL: http://hdl.handle.net/10.1080/01621459.2012.751872 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:177-186 Template-Type: ReDIF-Article 1.0 Author-Name: Garritt Page Author-X-Name-First: Garritt Author-X-Name-Last: Page Author-Name: Abhishek Bhattacharya Author-X-Name-First: Abhishek Author-X-Name-Last: Bhattacharya Author-Name: David Dunson Author-X-Name-First: David Author-X-Name-Last: Dunson Title: Classification via Bayesian Nonparametric Learning of Affine Subspaces Abstract: It has become common for datasets to contain large numbers of variables in studies conducted in areas such as genetics, machine vision, image analysis, and many others. When analyzing such data, parametric models are often too inflexible while nonparametric procedures tend to be nonrobust because of insufficient data on these high-dimensional spaces. This is particularly true when interest lies in building efficient classifiers in the presence of many predictor variables. When dealing with these types of data, it is often the case that most of the variability tends to lie along a few directions, or more generally along a much smaller dimensional submanifold of the data space. In this article, we propose a class of models that flexibly learn about this submanifold while simultaneously performing dimension reduction in classification. This methodology allows the cell probabilities to vary nonparametrically based on a few coordinates expressed as linear combinations of the predictors. Also, as opposed to many black-box methods for dimensionality reduction, the proposed model is appealing in having clearly interpretable and identifiable parameters that provide insight into which predictors are important in determining accurate classification boundaries. Gibbs sampling methods are developed for posterior computation, and the methods are illustrated using simulated and real data applications. Journal: Journal of the American Statistical Association Pages: 187-201 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2013.763566 File-URL: http://hdl.handle.net/10.1080/01621459.2013.763566 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:187-201 Template-Type: ReDIF-Article 1.0 Author-Name: Arlene Naranjo Author-X-Name-First: Arlene Author-X-Name-Last: Naranjo Author-Name: A. Alexandre Trindade Author-X-Name-First: A. Alexandre Author-X-Name-Last: Trindade Author-Name: George Casella Author-X-Name-First: George Author-X-Name-Last: Casella Title: Extending the State-Space Model to Accommodate Missing Values in Responses and Covariates Abstract: This article proposes an extended state-space model for accommodating multivariate panel data. The novel aspect of this contribution is an adjustment to the classical model for multiple subjects that allows missingness in the covariates in addition to the responses. Missing covariate data are handled by a second state-space model nested inside the first to represent unobserved exogenous information. Relevant Kalman filter equations are derived, and explicit expressions are provided for both the E- and M-steps of an expectation-maximization (EM) algorithm, to obtain maximum (Gaussian) likelihood estimates of all model parameters. In the presence of missing data, the resulting EM algorithm becomes computationally intractable, but a simplification of the M-step leads to a new procedure that is shown to be an expectation/conditional maximization (ECM) algorithm under exogeneity of the covariates. Simulation studies reveal that the approach appears to be relatively robust to moderate percentages of missing data, even with fewer subjects and time points, and that estimates are generally consistent with the asymptotics. The methodology is applied to a dataset from a published panel study of elderly patients with impaired respiratory function. Forecasted values thus obtained may serve as an "early-warning" mechanism for identifying patients whose lung function is nearing critical levels. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 202-216 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.746066 File-URL: http://hdl.handle.net/10.1080/01621459.2012.746066 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:202-216 Template-Type: ReDIF-Article 1.0 Author-Name: Jane Paik Kim Author-X-Name-First: Jane Paik Author-X-Name-Last: Kim Author-Name: Wenbin Lu Author-X-Name-First: Wenbin Author-X-Name-Last: Lu Author-Name: Tony Sit Author-X-Name-First: Tony Author-X-Name-Last: Sit Author-Name: Zhiliang Ying Author-X-Name-First: Zhiliang Author-X-Name-Last: Ying Title: A Unified Approach to Semiparametric Transformation Models Under General Biased Sampling Schemes Abstract: We propose a unified estimation method for semiparametric linear transformation models under general biased sampling schemes. The new estimator is obtained from a set of counting process-based unbiased estimating equations, developed through introducing a general weighting scheme that offsets the sampling bias. The usual asymptotic properties, including consistency and asymptotic normality, are established under suitable regularity conditions. A closed-form formula is derived for the limiting variance and the plug-in estimator is shown to be consistent. We demonstrate the unified approach through the special cases of left truncation, length bias, the case-cohort design, and variants thereof. Simulation studies and applications to real datasets are presented. Journal: Journal of the American Statistical Association Pages: 217-227 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.746073 File-URL: http://hdl.handle.net/10.1080/01621459.2012.746073 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:217-227 Template-Type: ReDIF-Article 1.0 Author-Name: Qian Jiang Author-X-Name-First: Qian Author-X-Name-Last: Jiang Author-Name: Hansheng Wang Author-X-Name-First: Hansheng Author-X-Name-Last: Wang Author-Name: Yingcun Xia Author-X-Name-First: Yingcun Author-X-Name-Last: Xia Author-Name: Guohua Jiang Author-X-Name-First: Guohua Author-X-Name-Last: Jiang Title: On a Principal Varying Coefficient Model Abstract: We propose a novel varying coefficient model (VCM), called principal varying coefficient model (PVCM), by characterizing the varying coefficients through linear combinations of a few principal functions. Compared with the conventional VCM, PVCM reduces the actual number of nonparametric functions and thus has better estimation efficiency. Compared with the semivarying coefficient model (SVCM), PVCM is more flexible but with the same estimation efficiency when the number of principal functions in PVCM and the number of varying coefficients in SVCM are the same. Model estimation and identification are investigated, and the better estimation efficiency is justified theoretically. Incorporating the estimation with the L 1 penalty, variables in the linear combinations can be selected automatically, and hence, the estimation efficiency can be further improved. Numerical experiments suggest that the model together with the estimation method is useful even when the number of covariates is large. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 228-236 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.736904 File-URL: http://hdl.handle.net/10.1080/01621459.2012.736904 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:228-236 Template-Type: ReDIF-Article 1.0 Author-Name: Zhenghui Feng Author-X-Name-First: Zhenghui Author-X-Name-Last: Feng Author-Name: Xuerong Meggie Wen Author-X-Name-First: Xuerong Meggie Author-X-Name-Last: Wen Author-Name: Zhou Yu Author-X-Name-First: Zhou Author-X-Name-Last: Yu Author-Name: Lixing Zhu Author-X-Name-First: Lixing Author-X-Name-Last: Zhu Title: On Partial Sufficient Dimension Reduction With Applications to Partially Linear Multi-Index Models Abstract: Partial dimension reduction is a general method to seek informative convex combinations of predictors of primary interest, which includes dimension reduction as its special case when the predictors in the remaining part are constants. In this article, we propose a novel method to conduct partial dimension reduction estimation for predictors of primary interest without assuming that the remaining predictors are categorical. To this end, we first take the dichotomization step such that any existing approach for partial dimension reduction estimation can be employed. Then we take the expectation step to integrate over all the dichotomic predictors to identify the partial central subspace. As an example, we use the partially linear multi-index model to illustrate its applications for semiparametric modeling. Simulations and real data examples are given to illustrate our methodology. Journal: Journal of the American Statistical Association Pages: 237-246 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.746065 File-URL: http://hdl.handle.net/10.1080/01621459.2012.746065 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:237-246 Template-Type: ReDIF-Article 1.0 Author-Name: Wei Lin Author-X-Name-First: Wei Author-X-Name-Last: Lin Author-Name: Jinchi Lv Author-X-Name-First: Jinchi Author-X-Name-Last: Lv Title: High-Dimensional Sparse Additive Hazards Regression Abstract: High-dimensional sparse modeling with censored survival data is of great practical importance, as exemplified by modern applications in high-throughput genomic data analysis and credit risk analysis. In this article, we propose a class of regularization methods for simultaneous variable selection and estimation in the additive hazards model, by combining the nonconcave penalized likelihood approach and the pseudoscore method. In a high-dimensional setting where the dimensionality can grow fast, polynomially or nonpolynomially, with the sample size, we establish the weak oracle property and oracle property under mild, interpretable conditions, thus providing strong performance guarantees for the proposed methodology. Moreover, we show that the regularity conditions required by the L 1 method are substantially relaxed by a certain class of sparsity-inducing concave penalties. As a result, concave penalties such as the smoothly clipped absolute deviation, minimax concave penalty, and smooth integration of counting and absolute deviation can significantly improve on the L 1 method and yield sparser models with better prediction performance. We present a coordinate descent algorithm for efficient implementation and rigorously investigate its convergence properties. The practical use and effectiveness of the proposed methods are demonstrated by simulation studies and a real data example. Journal: Journal of the American Statistical Association Pages: 247-264 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.746068 File-URL: http://hdl.handle.net/10.1080/01621459.2012.746068 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:247-264 Template-Type: ReDIF-Article 1.0 Author-Name: Tony Cai Author-X-Name-First: Tony Author-X-Name-Last: Cai Author-Name: Weidong Liu Author-X-Name-First: Weidong Author-X-Name-Last: Liu Author-Name: Yin Xia Author-X-Name-First: Yin Author-X-Name-Last: Xia Title: Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings Abstract: In the high-dimensional setting, this article considers three interrelated problems: (a) testing the equality of two covariance matrices and ; (b) recovering the support of ; and (c) testing the equality of and row by row. We propose a new test for testing the hypothesis H 0: and investigate its theoretical and numerical properties. The limiting null distribution of the test statistic is derived and the power of the test is studied. The test is shown to enjoy certain optimality and to be especially powerful against sparse alternatives. The simulation results show that the test significantly outperforms the existing methods both in terms of size and power. Analysis of a prostate cancer dataset is carried out to demonstrate the application of the testing procedures. When the null hypothesis of equal covariance matrices is rejected, it is often of significant interest to further investigate how they differ from each other. Motivated by applications in genomics, we also consider recovering the support of and testing the equality of the two covariance matrices row by row. New procedures are introduced and their properties are studied. Applications to gene selection are also discussed. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 265-277 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.758041 File-URL: http://hdl.handle.net/10.1080/01621459.2012.758041 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:265-277 Template-Type: ReDIF-Article 1.0 Author-Name: Jing Lei Author-X-Name-First: Jing Author-X-Name-Last: Lei Author-Name: James Robins Author-X-Name-First: James Author-X-Name-Last: Robins Author-Name: Larry Wasserman Author-X-Name-First: Larry Author-X-Name-Last: Wasserman Title: Distribution-Free Prediction Sets Abstract: This article introduces a new approach to prediction by bringing together two different nonparametric ideas: distribution-free inference and nonparametric smoothing. Specifically, we consider the problem of constructing nonparametric tolerance/prediction sets. We start from the general conformal prediction approach, and we use a kernel density estimator as a measure of agreement between a sample point and the underlying distribution. The resulting prediction set is shown to be closely related to plug-in density level sets with carefully chosen cutoff values. Under standard smoothness conditions, we get an asymptotic efficiency result that is near optimal for a wide range of function classes. But the coverage is guaranteed whether or not the smoothness conditions hold and regardless of the sample size. The performance of our method is investigated through simulation studies and illustrated in a real data example. Journal: Journal of the American Statistical Association Pages: 278-287 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.751873 File-URL: http://hdl.handle.net/10.1080/01621459.2012.751873 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:278-287 Template-Type: ReDIF-Article 1.0 Author-Name: Fei Fu Author-X-Name-First: Fei Author-X-Name-Last: Fu Author-Name: Qing Zhou Author-X-Name-First: Qing Author-X-Name-Last: Zhou Title: Learning Sparse Causal Gaussian Networks With Experimental Intervention: Regularization and Coordinate Descent Abstract: Causal networks are graphically represented by directed acyclic graphs (DAGs). Learning causal networks from data is a challenging problem due to the size of the space of DAGs, the acyclicity constraint placed on the graphical structures, and the presence of equivalence classes. In this article, we develop an L 1-penalized likelihood approach to estimate the structure of causal Gaussian networks. A blockwise coordinate descent algorithm, which takes advantage of the acyclicity constraint, is proposed for seeking a local maximizer of the penalized likelihood. We establish that model selection consistency for causal Gaussian networks can be achieved with the adaptive lasso penalty and sufficient experimental interventions. Simulation and real data examples are used to demonstrate the effectiveness of our method. In particular, our method shows satisfactory performance for DAGs with 200 nodes, which have about 20,000 free parameters. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 288-300 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.754359 File-URL: http://hdl.handle.net/10.1080/01621459.2012.754359 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:288-300 Template-Type: ReDIF-Article 1.0 Author-Name: Ryan Martin Author-X-Name-First: Ryan Author-X-Name-Last: Martin Author-Name: Chuanhai Liu Author-X-Name-First: Chuanhai Author-X-Name-Last: Liu Title: Inferential Models: A Framework for Prior-Free Posterior Probabilistic Inference Abstract: Posterior probabilistic statistical inference without priors is an important but so far elusive goal. Fisher's fiducial inference, Dempster--Shafer theory of belief functions, and Bayesian inference with default priors are attempts to achieve this goal but, to date, none has given a completely satisfactory picture. This article presents a new framework for probabilistic inference, based on inferential models (IMs), which not only provides data-dependent probabilistic measures of uncertainty about the unknown parameter, but also does so with an automatic long-run frequency-calibration property. The key to this new approach is the identification of an unobservable auxiliary variable associated with observable data and unknown parameter, and the prediction of this auxiliary variable with a random set before conditioning on data. Here we present a three-step IM construction, and prove a frequency-calibration property of the IM's belief function under mild conditions. A corresponding optimality theory is developed, which helps to resolve the nonuniqueness issue. Several examples are presented to illustrate this new approach. Journal: Journal of the American Statistical Association Pages: 301-313 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.747960 File-URL: http://hdl.handle.net/10.1080/01621459.2012.747960 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:301-313 Template-Type: ReDIF-Article 1.0 Author-Name: Christoph Rothe Author-X-Name-First: Christoph Author-X-Name-Last: Rothe Author-Name: Dominik Wied Author-X-Name-First: Dominik Author-X-Name-Last: Wied Title: Misspecification Testing in a Class of Conditional Distributional Models Abstract: We propose a specification test for a wide range of parametric models for the conditional distribution function of an outcome variable given a vector of covariates. The test is based on the Cramer--von Mises distance between an unrestricted estimate of the joint distribution function of the data and a restricted estimate that imposes the structure implied by the model. The procedure is straightforward to implement, is consistent against fixed alternatives, has nontrivial power against local deviations of order n -super- - 1/2 from the null hypothesis, and does not require the choice of smoothing parameters. In an empirical application, we use our test to study the validity of various models for the conditional distribution of wages in the United States. Journal: Journal of the American Statistical Association Pages: 314-324 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.736903 File-URL: http://hdl.handle.net/10.1080/01621459.2012.736903 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:314-324 Template-Type: ReDIF-Article 1.0 Author-Name: Faming Liang Author-X-Name-First: Faming Author-X-Name-Last: Liang Author-Name: Yichen Cheng Author-X-Name-First: Yichen Author-X-Name-Last: Cheng Author-Name: Qifan Song Author-X-Name-First: Qifan Author-X-Name-Last: Song Author-Name: Jincheol Park Author-X-Name-First: Jincheol Author-X-Name-Last: Park Author-Name: Ping Yang Author-X-Name-First: Ping Author-X-Name-Last: Yang Title: A Resampling-Based Stochastic Approximation Method for Analysis of Large Geostatistical Data Abstract: The Gaussian geostatistical model has been widely used in modeling of spatial data. However, it is challenging to computationally implement this method because it requires the inversion of a large covariance matrix, particularly when there is a large number of observations. This article proposes a resampling-based stochastic approximation method to address this challenge. At each iteration of the proposed method, a small subsample is drawn from the full dataset, and then the current estimate of the parameters is updated accordingly under the framework of stochastic approximation. Since the proposed method makes use of only a small proportion of the data at each iteration, it avoids inverting large covariance matrices and thus is scalable to large datasets. The proposed method also leads to a general parameter estimation approach, maximum mean log-likelihood estimation, which includes the popular maximum (log)-likelihood estimation (MLE) approach as a special case and is expected to play an important role in analyzing large datasets. Under mild conditions, it is shown that the estimator resulting from the proposed method converges in probability to a set of parameter values of equivalent Gaussian probability measures, and that the estimator is asymptotically normally distributed. To the best of the authors' knowledge, the present study is the first one on asymptotic normality under infill asymptotics for general covariance functions. The proposed method is illustrated with large datasets, both simulated and real. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 325-339 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.746061 File-URL: http://hdl.handle.net/10.1080/01621459.2012.746061 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:325-339 Template-Type: ReDIF-Article 1.0 Author-Name: G. García-donato Author-X-Name-First: G. Author-X-Name-Last: García-donato Author-Name: M. A. Martínez-beneito Author-X-Name-First: M. A. Author-X-Name-Last: Martínez-beneito Title: On Sampling Strategies in Bayesian Variable Selection Problems With Large Model Spaces Abstract: One important aspect of Bayesian model selection is how to deal with huge model spaces, since the exhaustive enumeration of all the models entertained is not feasible and inferences have to be based on the very small proportion of models visited. This is the case for the variable selection problem with a moderately large number of possible explanatory variables considered in this article. We review some of the strategies proposed in the literature, from a theoretical point of view using arguments of sampling theory and in practical terms using several examples with a known answer. All our results seem to indicate that sampling methods with frequency-based estimators outperform searching methods with renormalized estimators. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 340-352 Issue: 501 Volume: 108 Year: 2013 Month: 3 X-DOI: 10.1080/01621459.2012.742443 File-URL: http://hdl.handle.net/10.1080/01621459.2012.742443 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:340-352 Template-Type: ReDIF-Article 1.0 Author-Name: Roderick J. Little Author-X-Name-First: Roderick J. Author-X-Name-Last: Little Title: In Praise of Simplicity not Mathematistry! Ten Simple Powerful Ideas for the Statistical Scientist Abstract: Ronald Fisher was by all accounts a first-rate mathematician, but he saw himself as a scientist, not a mathematician, and he railed against what George Box called (in his Fisher lecture) "mathematistry." Mathematics is the indispensable foundation of statistics, but for me the real excitement and value of our subject lies in its application to other disciplines. We should not view statistics as another branch of mathematics and favor mathematical complexity over clarifying, formulating, and solving real-world problems. Valuing simplicity, I describe 10 simple and powerful ideas that have influenced my thinking about statistics, in my areas of research interest: missing data, causal inference, survey sampling, and statistical modeling in general. The overarching theme is that statistics is a missing data problem and the goal is to predict unknowns with appropriate measures of uncertainty. Journal: Journal of the American Statistical Association Pages: 359-369 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.787932 File-URL: http://hdl.handle.net/10.1080/01621459.2013.787932 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:359-369 Template-Type: ReDIF-Article 1.0 Author-Name: Edward Ip Author-X-Name-First: Edward Author-X-Name-Last: Ip Author-Name: Qiang Zhang Author-X-Name-First: Qiang Author-X-Name-Last: Zhang Author-Name: Jack Rejeski Author-X-Name-First: Jack Author-X-Name-Last: Rejeski Author-Name: Tammy Harris Author-X-Name-First: Tammy Author-X-Name-Last: Harris Author-Name: Stephen Kritchevsky Author-X-Name-First: Stephen Author-X-Name-Last: Kritchevsky Title: Partially Ordered Mixed Hidden Markov Model for the Disablement Process of Older Adults Abstract: At both the individual and societal levels, the health and economic burden of disability in older adults is enormous in developed countries, including the U.S. Recent studies have revealed that the disablement process in older adults often comprises episodic periods of impaired functioning and periods that are relatively free of disability, amid a secular and natural trend of decline in functioning. Rather than an irreversible, progressive event that is analogous to a chronic disease, disability is better conceptualized and mathematically modeled as states that do not necessarily follow a strict linear order of good to bad. Statistical tools, including Markov models, which allow bidirectional transition between states, and random effects models, which allow individual-specific rate of secular decline, are pertinent. In this article, we propose a mixed effects, multivariate, hidden Markov model to handle partially ordered disability states. The model generalizes the continuation ratio model for ordinal data in the generalized linear model literature and provides a formal framework for testing the effects of risk factors and/or an intervention on the transitions between different disability states. Under a generalization of the proportional odds ratio assumption, the proposed model circumvents the problem of a potentially large number of parameters when the number of states and the number of covariates are substantial. We describe a maximum likelihood method for estimating the partially ordered, mixed effects model and show how the model can be applied to a longitudinal dataset that consists of N = 2903 older adults followed for 10 years in the Health Aging and Body Composition Study. We further statistically test the effects of various risk factors upon the probabilities of transition into various severe disability states. The result can be used to inform geriatric and public health science researchers who study the disablement process. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 370-384 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.770307 File-URL: http://hdl.handle.net/10.1080/01621459.2013.770307 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:370-384 Template-Type: ReDIF-Article 1.0 Author-Name: Mauricio Sadinle Author-X-Name-First: Mauricio Author-X-Name-Last: Sadinle Author-Name: Stephen E. Fienberg Author-X-Name-First: Stephen E. Author-X-Name-Last: Fienberg Title: A Generalized Fellegi--Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems Abstract: We present a probabilistic method for linking multiple datafiles. This task is not trivial in the absence of unique identifiers for the individuals recorded. This is a common scenario when linking census data to coverage measurement surveys for census coverage evaluation, and in general when multiple record systems need to be integrated for posterior analysis. Our method generalizes the Fellegi--Sunter theory for linking records from two datafiles and its modern implementations. The goal of multiple record linkage is to classify the record K-tuples coming from K datafiles according to the different matching patterns. Our method incorporates the transitivity of agreement in the computation of the data used to model matching probabilities. We use a mixture model to fit matching probabilities via maximum likelihood using the Expectation--Maximization algorithm. We present a method to decide the record K-tuples membership to the subsets of matching patterns and we prove its optimality. We apply our method to the integration of the three Colombian homicide record systems and perform a simulation study to explore the performance of the method under measurement error and different scenarios. The proposed method works well and opens new directions for future research. Journal: Journal of the American Statistical Association Pages: 385-397 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2012.757231 File-URL: http://hdl.handle.net/10.1080/01621459.2012.757231 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:385-397 Template-Type: ReDIF-Article 1.0 Author-Name: Natallia Katenka Author-X-Name-First: Natallia Author-X-Name-Last: Katenka Author-Name: Elizaveta Levina Author-X-Name-First: Elizaveta Author-X-Name-Last: Levina Author-Name: George Michailidis Author-X-Name-First: George Author-X-Name-Last: Michailidis Title: Tracking Multiple Targets Using Binary Decisions From Wireless Sensor Networks Abstract: This article introduces a framework for tracking multiple targets over time using binary decisions collected by a wireless sensor network, and applies the methodology to two case studies-an experiment involving tracking people and a dataset adapted from a project tracking zebras in Kenya. The tracking approach is based on a penalized maximum likelihood framework, and allows for sensor failures, targets appearing and disappearing over time, and complex intersecting target trajectories. We show that binary decisions about the presence/absence of a target in a sensor's neighborhood, corrected locally by a method known as local vote decision fusion, provide the most robust performance in noisy environments and give good tracking results in applications. Journal: Journal of the American Statistical Association Pages: 398-410 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.770284 File-URL: http://hdl.handle.net/10.1080/01621459.2013.770284 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:398-410 Template-Type: ReDIF-Article 1.0 Author-Name: Avishek Chakraborty Author-X-Name-First: Avishek Author-X-Name-Last: Chakraborty Author-Name: Bani K. Mallick Author-X-Name-First: Bani K. Author-X-Name-Last: Mallick Author-Name: Ryan G. Mcclarren Author-X-Name-First: Ryan G. Author-X-Name-Last: Mcclarren Author-Name: Carolyn C. Kuranz Author-X-Name-First: Carolyn C. Author-X-Name-Last: Kuranz Author-Name: Derek Bingham Author-X-Name-First: Derek Author-X-Name-Last: Bingham Author-Name: Michael J. Grosskopf Author-X-Name-First: Michael J. Author-X-Name-Last: Grosskopf Author-Name: Erica M. Rutter Author-X-Name-First: Erica M. Author-X-Name-Last: Rutter Author-Name: Hayes F. Stripling Author-X-Name-First: Hayes F. Author-X-Name-Last: Stripling Author-Name: R. Paul Drake Author-X-Name-First: R. Paul Author-X-Name-Last: Drake Title: Spline-Based Emulators for Radiative Shock Experiments With Measurement Error Abstract: Radiation hydrodynamics and radiative shocks are of fundamental interest in the high-energy-density physics research due to their importance in understanding astrophysical phenomena such as supernovae. In the laboratory, experiments can produce shocks with fundamentally similar physics on reduced scales. However, the cost and time constraints of the experiment necessitate use of a computer algorithm to generate a reasonable number of outputs for making valid inference. We focus on modeling emulators that can efficiently assimilate these two sources of information accounting for their intrinsic differences. The goal is to learn how to predict the breakout time of the shock given the information on associated parameters such as pressure and energy. Under the framework of the Kennedy--O'Hagan model, we introduce an emulator based on adaptive splines. Depending on the preference of having an interpolator for the computer code output or a computationally fast model, a couple of different variants are proposed. Those choices are shown to perform better than the conventional Gaussian-process-based emulator and a few other choices of nonstationary models. For the shock experiment dataset, a number of features related to computer model validation such as using interpolator, necessity of discrepancy function, or accounting for experimental heterogeneity are discussed, implemented, and validated for the current dataset. In addition to the typical Gaussian measurement error for real data, we consider alternative specifications suitable to incorporate noninformativeness in error distributions, more in agreement with the current experiment. Comparative diagnostics, to highlight the effect of measurement error model on predictive uncertainty, are also presented. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 411-428 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.770688 File-URL: http://hdl.handle.net/10.1080/01621459.2013.770688 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:411-428 Template-Type: ReDIF-Article 1.0 Author-Name: Curtis B. Storlie Author-X-Name-First: Curtis B. Author-X-Name-Last: Storlie Author-Name: Sarah E. Michalak Author-X-Name-First: Sarah E. Author-X-Name-Last: Michalak Author-Name: Heather M. Quinn Author-X-Name-First: Heather M. Author-X-Name-Last: Quinn Author-Name: Andrew J. Dubois Author-X-Name-First: Andrew J. Author-X-Name-Last: Dubois Author-Name: Steven A. Wender Author-X-Name-First: Steven A. Author-X-Name-Last: Wender Author-Name: David H. Dubois Author-X-Name-First: David H. Author-X-Name-Last: Dubois Title: A Bayesian Reliability Analysis of Neutron-Induced Errors in High Performance Computing Hardware Abstract: A soft error is an undesired change in an electronic device's state, for example, a bit flip in computer memory, that does not permanently affect its functionality. In microprocessor systems, neutron-induced soft errors can cause crashes and silent data corruption (SDC). SDC occurs when a soft error produces a computational result that is incorrect, without the system issuing a warning or error message. Hence, neutron-induced soft errors are a major concern for high performance computing platforms that perform scientific computation. Through accelerated neutron beam testing of hardware in its field configuration, the frequencies of failures (crashes) and of SDCs in hardware from the Roadrunner platform, the first Petaflop supercomputer, are estimated. The impact of key factors on field performance is investigated and estimates of field reliability are provided. Finally, a novel statistical approach for the analysis of interval-censored survival data with mixed effects and uncertainty in the interval endpoints, key features of the experimental data, is presented. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 429-440 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.770694 File-URL: http://hdl.handle.net/10.1080/01621459.2013.770694 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:429-440 Template-Type: ReDIF-Article 1.0 Author-Name: Monica Costa Dias Author-X-Name-First: Monica Costa Author-X-Name-Last: Dias Author-Name: Hidehiko Ichimura Author-X-Name-First: Hidehiko Author-X-Name-Last: Ichimura Author-Name: Gerard J. van den Berg Author-X-Name-First: Gerard J. Author-X-Name-Last: van den Berg Title: Treatment Evaluation With Selective Participation and Ineligibles Abstract: Matching methods for treatment evaluation based on a conditional independence assumption do not balance selective unobserved differences between treated and nontreated. We derive a simple correction term if there is an instrument that shifts the treatment probability to zero in specific cases. Policies with eligibility restrictions, where treatment is impossible if some variable exceeds a certain value, provide a natural application. In an empirical analysis, we exploit the age eligibility restriction in the Swedish Youth Practice subsidized work program for young unemployed, where compliance is imperfect among the young. Adjusting the matching estimator for selectivity changes the results toward making subsidized work detrimental in moving individuals into employment. Journal: Journal of the American Statistical Association Pages: 441-455 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.795447 File-URL: http://hdl.handle.net/10.1080/01621459.2013.795447 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:441-455 Template-Type: ReDIF-Article 1.0 Author-Name: David A. Friedenberg Author-X-Name-First: David A. Author-X-Name-Last: Friedenberg Author-Name: Christopher R. Genovese Author-X-Name-First: Christopher R. Author-X-Name-Last: Genovese Title: Straight to the Source: Detecting Aggregate Objects in Astronomical Images With Proper Error Control Abstract: The next generation of telescopes, coming online in the next decade, will acquire terabytes of image data each night. Collectively, these large images will contain billions of interesting objects, which astronomers call sources. One critical task for astronomers is to construct from the image data a detailed source catalog that gives the sky coordinates and other properties of all detected sources. The source catalog is the primary data product produced by most telescopes and serves as an important input for studies that build and test new astrophysical theories. To construct an accurate catalog, the sources must first be detected in the image. A variety of effective source detection algorithms exist in the astronomical literature, but few, if any, provide rigorous statistical control of error rates. A variety of multiple testing procedures exist in the statistical literature that can provide rigorous error control over pixelwise errors, but these do not provide control over errors at the level of sources, which is what astronomers need. In this article, we propose a technique that is effective at source detection while providing rigorous control on sourcewise error rates. We demonstrate our approach with data from the Chandra X-ray Observatory Satellite. Our method is competitive with existing astronomical methods, even finding two new sources that were missed by previous studies, while providing stronger performance guarantees and without requiring costly follow up studies that are commonly required with current techniques. Journal: Journal of the American Statistical Association Pages: 456-468 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.779829 File-URL: http://hdl.handle.net/10.1080/01621459.2013.779829 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:456-468 Template-Type: ReDIF-Article 1.0 Author-Name: Tyler J. Vanderweele Author-X-Name-First: Tyler J. Author-X-Name-Last: Vanderweele Author-Name: Guanglei Hong Author-X-Name-First: Guanglei Author-X-Name-Last: Hong Author-Name: Stephanie M. Jones Author-X-Name-First: Stephanie M. Author-X-Name-Last: Jones Author-Name: Joshua L. Brown Author-X-Name-First: Joshua L. Author-X-Name-Last: Brown Title: Mediation and Spillover Effects in Group-Randomized Trials: A Case Study of the 4Rs Educational Intervention Abstract: Peer influence and social interactions can give rise to spillover effects in which the exposure of one individual may affect outcomes of other individuals. Even if the intervention under study occurs at the group or cluster level as in group-randomized trials, spillover effects can occur when the mediator of interest is measured at a lower level than the treatment. Evaluators who choose groups rather than individuals as experimental units in a randomized trial often anticipate that the desirable changes in targeted social behaviors will be reinforced through interference among individuals in a group exposed to the same treatment. In an empirical evaluation of the effect of a school-wide intervention on reducing individual students' depressive symptoms, schools in matched pairs were randomly assigned to the 4Rs intervention or the control condition. Class quality was hypothesized as an important mediator assessed at the classroom level. We reason that the quality of one classroom may affect outcomes of children in another classroom because children interact not simply with their classmates but also with those from other classes in the hallways or on the playground. In investigating the role of class quality as a mediator, failure to account for such spillover effects of one classroom on the outcomes of children in other classrooms can potentially result in bias and problems with interpretation. Using a counterfactual conceptualization of direct, indirect, and spillover effects, we provide a framework that can accommodate issues of mediation and spillover effects in group randomized trials. We show that the total effect can be decomposed into a natural direct effect, a within-classroom mediated effect, and a spillover mediated effect. We give identification conditions for each of the causal effects of interest and provide results on the consequences of ignoring "interference" or "spillover effects" when they are in fact present. Our modeling approach disentangles these effects. The analysis examines whether the 4Rs intervention has an effect on childrens' depressive symptoms through changing the quality of other classes as well as through changing the quality of a child's own class. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 469-482 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.779832 File-URL: http://hdl.handle.net/10.1080/01621459.2013.779832 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:469-482 Template-Type: ReDIF-Article 1.0 Author-Name: Yueqing Wang Author-X-Name-First: Yueqing Author-X-Name-Last: Wang Author-Name: Xin Jiang Author-X-Name-First: Xin Author-X-Name-Last: Jiang Author-Name: Bin Yu Author-X-Name-First: Bin Author-X-Name-Last: Yu Author-Name: Ming Jiang Author-X-Name-First: Ming Author-X-Name-Last: Jiang Title: A Hierarchical Bayesian Approach for Aerosol Retrieval Using MISR Data Abstract: Atmospheric aerosols can cause serious damage to human health and reduce life expectancy. Using the radiances observed by NASA's Multi-angle Imaging SpectroRadiometer (MISR), the current MISR operational algorithm retrieves aerosol optical depth (AOD) at 17.6 km resolution. A systematic study of aerosols and their impact on public health, especially in highly populated urban areas, requires finer-resolution estimates of AOD's spatial distribution. We embed MISR's operational weighted least squares criterion and its forward calculations for AOD retrievals in a likelihood framework and further expand into a hierarchical Bayesian model to adapt to finer spatial resolution of 4.4 km. To take advantage of AOD's spatial smoothness, our method borrows strength from data at neighboring areas by postulating a Gaussian Markov random field prior for AOD. Our model considers AOD and aerosol mixing vectors as continuous variables, whose inference is carried out using Metropolis-within-Gibbs sampling methods. Retrieval uncertainties are quantified by posterior variabilities. We also develop a parallel Markov chain Monte Carlo (MCMC) algorithm to improve computational efficiency. We assess our retrieval performance using ground-based measurements from the AErosol RObotic NETwork (AERONET) and satellite images from Google Earth. Based on case studies in the greater Beijing area, China, we show that 4.4 km resolution can improve both the accuracy and coverage of remotely sensed aerosol retrievals, as well as our understanding of the spatial and seasonal behaviors of aerosols. This is particularly important during high-AOD events, which often indicate severe air pollution. Journal: Journal of the American Statistical Association Pages: 483-493 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.796834 File-URL: http://hdl.handle.net/10.1080/01621459.2013.796834 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:483-493 Template-Type: ReDIF-Article 1.0 Author-Name: Sungduk Kim Author-X-Name-First: Sungduk Author-X-Name-Last: Kim Author-Name: Zhen Chen Author-X-Name-First: Zhen Author-X-Name-Last: Chen Author-Name: Zhiwei Zhang Author-X-Name-First: Zhiwei Author-X-Name-Last: Zhang Author-Name: Bruce G. Simons-Morton Author-X-Name-First: Bruce G. Author-X-Name-Last: Simons-Morton Author-Name: Paul S. Albert Author-X-Name-First: Paul S. Author-X-Name-Last: Albert Title: Bayesian Hierarchical Poisson Regression Models: An Application to a Driving Study With Kinematic Events Abstract: Although there is evidence that teenagers are at a high risk of crashes in the early months after licensure, the driving behavior of these teenagers is not well understood. The Naturalistic Teenage Driving Study (NTDS) is the first U.S. study to document continuous driving performance of newly licensed teenagers during their first 18 months of licensure. Counts of kinematic events such as the number of rapid accelerations are available for each trip, and their incidence rates represent different aspects of driving behavior. We propose a hierarchical Poisson regression model incorporating overdispersion, heterogeneity, and serial correlation as well as a semiparametric mean structure. Analysis of the NTDS data is carried out with a hierarchical Bayesian framework using reversible jump Markov chain Monte Carlo algorithms to accommodate the flexible mean structure. We show that driving with a passenger and night driving decrease kinematic events, while having risky friends increases these events. Further the within-subject variation in these events is comparable to the between-subject variation. This methodology will be useful for other intensively collected longitudinal count data, where event rates are low and interest focuses on estimating the mean and variance structure of the process. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 494-503 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.770702 File-URL: http://hdl.handle.net/10.1080/01621459.2013.770702 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:494-503 Template-Type: ReDIF-Article 1.0 Author-Name: Zahra Siddique Author-X-Name-First: Zahra Author-X-Name-Last: Siddique Title: Partially Identified Treatment Effects Under Imperfect Compliance: The Case of Domestic Violence Abstract: The Minneapolis Domestic Violence Experiment (MDVE) is a randomized social experiment with imperfect compliance that has been extremely influential in how police officers respond to misdemeanor domestic violence. This article reexamines data from the MDVE, using recent literature on partial identification to find recidivism associated with a policy that arrests misdemeanor domestic violence suspects rather than not arresting them. Using partially identified bounds on the average treatment effect, I find that arresting rather than not arresting suspects can potentially reduce recidivism by more than two-and-a-half times the corresponding intent-to-treat estimate and more than two times the corresponding local average treatment effect, even when making minimal assumptions on counterfactuals. Journal: Journal of the American Statistical Association Pages: 504-513 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.779836 File-URL: http://hdl.handle.net/10.1080/01621459.2013.779836 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:504-513 Template-Type: ReDIF-Article 1.0 Author-Name: Josue G. Martinez Author-X-Name-First: Josue G. Author-X-Name-Last: Martinez Author-Name: Kirsten M. Bohn Author-X-Name-First: Kirsten M. Author-X-Name-Last: Bohn Author-Name: Raymond J. Carroll Author-X-Name-First: Raymond J. Author-X-Name-Last: Carroll Author-Name: Jeffrey S. Morris Author-X-Name-First: Jeffrey S. Author-X-Name-Last: Morris Title: A Study of Mexican Free-Tailed Bat Chirp Syllables: Bayesian Functional Mixed Models for Nonstationary Acoustic Time Series Abstract: We describe a new approach to analyze chirp syllables of free-tailed bats from two regions of Texas in which they are predominant: Austin and College Station. Our goal is to characterize any systematic regional differences in the mating chirps and assess whether individual bats have signature chirps. The data are analyzed by modeling spectrograms of the chirps as responses in a Bayesian functional mixed model. Given the variable chirp lengths, we compute the spectrograms on a relative time scale interpretable as the relative chirp position, using a variable window overlap based on chirp length. We use two-dimensional wavelet transforms to capture correlation within the spectrogram in our modeling and obtain adaptive regularization of the estimates and inference for the regions-specific spectrograms. Our model includes random effect spectrograms at the bat level to account for correlation among chirps from the same bat and to assess relative variability in chirp spectrograms within and between bats. The modeling of spectrograms using functional mixed models is a general approach for the analysis of replicated nonstationary time series, such as our acoustical signals, to relate aspects of the signals to various predictors, while accounting for between-signal structure. This can be done on raw spectrograms when all signals are of the same length and can be done using spectrograms defined on a relative time scale for signals of variable length in settings where the idea of defining correspondence across signals based on relative position is sensible. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 514-526 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.793118 File-URL: http://hdl.handle.net/10.1080/01621459.2013.793118 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:514-526 Template-Type: ReDIF-Article 1.0 Author-Name: Lihui Zhao Author-X-Name-First: Lihui Author-X-Name-Last: Zhao Author-Name: Lu Tian Author-X-Name-First: Lu Author-X-Name-Last: Tian Author-Name: Tianxi Cai Author-X-Name-First: Tianxi Author-X-Name-Last: Cai Author-Name: Brian Claggett Author-X-Name-First: Brian Author-X-Name-Last: Claggett Author-Name: L. J. Wei Author-X-Name-First: L. J. Author-X-Name-Last: Wei Title: Effectively Selecting a Target Population for a Future Comparative Study Abstract: When comparing a new treatment with a control in a randomized clinical study, the treatment effect is generally assessed by evaluating a summary measure over a specific study population. The success of the trial heavily depends on the choice of such a population. In this article, we show a systematic, effective way to identify a promising population, for which the new treatment is expected to have a desired benefit, using the data from a current study involving similar comparator treatments. Specifically, using the existing data, we first create a parametric scoring system as a function of multiple baseline covariates to estimate subject-specific treatment differences. Based on this scoring system, we specify a desired level of treatment difference and obtain a subgroup of patients, defined as those whose estimated scores exceed this threshold. An empirically calibrated threshold-specific treatment difference curve across a range of score values is constructed. The subpopulation of patients satisfying any given level of treatment benefit can then be identified accordingly. To avoid bias due to overoptimism, we use a cross-training-evaluation method for implementing the above two-step procedure. We then show how to select the best scoring system among all competing models. Furthermore, for cases in which only a single prespecified working model is involved, inference procedures are proposed for the average treatment difference over a range of score values using the entire dataset and are justified theoretically and numerically. Finally, the proposals are illustrated with the data from two clinical trials in treating HIV and cardiovascular diseases. Note that if we are not interested in designing a new study for comparing similar treatments, the new procedure can also be quite useful for the management of future patients, so that treatment may be targeted toward those who would receive nontrivial benefits to compensate for the risk or cost of the new treatment. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 527-539 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.770705 File-URL: http://hdl.handle.net/10.1080/01621459.2013.770705 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:527-539 Template-Type: ReDIF-Article 1.0 Author-Name: Hua Zhou Author-X-Name-First: Hua Author-X-Name-Last: Zhou Author-Name: Lexin Li Author-X-Name-First: Lexin Author-X-Name-Last: Li Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Title: Tensor Regression with Applications in Neuroimaging Data Analysis Abstract: Classical regression methods treat covariates as a vector and estimate a corresponding vector of regression coefficients. Modern applications in medical imaging generate covariates of more complex form such as multidimensional arrays (tensors). Traditional statistical and computational methods are proving insufficient for analysis of these high-throughput data due to their ultrahigh dimensionality as well as complex structure. In this article, we propose a new family of tensor regression models that efficiently exploit the special structure of tensor covariates. Under this framework, ultrahigh dimensionality is reduced to a manageable level, resulting in efficient estimation and prediction. A fast and highly scalable estimation algorithm is proposed for maximum likelihood estimation and its associated asymptotic properties are studied. Effectiveness of the new methods is demonstrated on both synthetic and real MRI imaging data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 540-552 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.776499 File-URL: http://hdl.handle.net/10.1080/01621459.2013.776499 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:540-552 Template-Type: ReDIF-Article 1.0 Author-Name: Yuanjia Wang Author-X-Name-First: Yuanjia Author-X-Name-Last: Wang Author-Name: Huaihou Chen Author-X-Name-First: Huaihou Author-X-Name-Last: Chen Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Christine Mauro Author-X-Name-First: Christine Author-X-Name-Last: Mauro Author-Name: Naihua Duan Author-X-Name-First: Naihua Author-X-Name-Last: Duan Author-Name: M. Katherine Shear Author-X-Name-First: M. Katherine Author-X-Name-Last: Shear Title: Auxiliary Marker-Assisted Classification in the Absence of Class Identifiers Abstract: Constructing classification rules for accurate diagnosis of a disorder is an important goal in medical practice. In many clinical applications, there is no clinically significant anatomical or physiological deviation that exists to identify the gold standard disease status to inform development of classification algorithms. Despite the absence of perfect disease class identifiers, there are usually one or more disease-informative auxiliary markers along with feature variables that comprise known symptoms. Existing statistical learning approaches do not effectively draw information from auxiliary prognostic markers. We propose a large margin classification method, with particular emphasis on the support vector machine, assisted by available informative markers to classify disease without knowing a subject's true disease status. We view this task as statistical learning in the presence of missing data, and introduce a pseudo-Expectation-Maximization (EM) algorithm to the classification. A major difference between a regular EM algorithm and the algorithm proposed here is that we do not model the distribution of missing data given the observed feature variables either parametrically or semiparametrically. We also propose a sparse variable selection method embedded in the pseudo-EM algorithm. Theoretical examination shows that the proposed classification rule is Fisher consistent, and that under a linear rule, the proposed selection has an oracle variable selection property and the estimated coefficients are asymptotically normal. We apply the methods to build decision rules for including subjects in clinical trials of a new psychiatric disorder and present four applications to data available at the University of California, Irvine Machine Learning Repository. Journal: Journal of the American Statistical Association Pages: 553-565 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.775949 File-URL: http://hdl.handle.net/10.1080/01621459.2013.775949 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:553-565 Template-Type: ReDIF-Article 1.0 Author-Name: Arpita Ghosh Author-X-Name-First: Arpita Author-X-Name-Last: Ghosh Author-Name: Fred A. Wright Author-X-Name-First: Fred A. Author-X-Name-Last: Wright Author-Name: Fei Zou Author-X-Name-First: Fei Author-X-Name-Last: Zou Title: Unified Analysis of Secondary Traits in Case--Control Association Studies Abstract: It has been repeatedly shown that in case--control association studies, analysis of a secondary trait that ignores the original sampling scheme can produce highly biased risk estimates. Although a number of approaches have been proposed to properly analyze secondary traits, most approaches fail to reproduce the marginal logistic model assumed for the original case--control trait and/or do not allow for interaction between secondary trait and genotype marker on primary disease risk. In addition, the flexible handling of covariates remains challenging. We present a general retrospective likelihood framework to perform association testing for both binary and continuous secondary traits, which respects marginal models and incorporates the interaction term. We provide a computational algorithm, based on a reparameterized approximate profile likelihood, for obtaining the maximum likelihood (ML) estimate and its standard error for the genetic effect on secondary traits, in the presence of covariates. For completeness, we also present an alternative pseudo-likelihood method for handling covariates. We describe extensive simulations to evaluate the performance of the ML estimator in comparison with the pseudo-likelihood and other competing methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 566-576 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.793121 File-URL: http://hdl.handle.net/10.1080/01621459.2013.793121 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:566-576 Template-Type: ReDIF-Article 1.0 Author-Name: Ting Zhang Author-X-Name-First: Ting Author-X-Name-Last: Zhang Title: Clustering High-Dimensional Time Series Based on Parallelism Abstract: This article considers the problem of clustering high-dimensional time series based on trend parallelism. The underlying process is modeled as a nonparametric trend function contaminated by locally stationary errors, a special class of nonstationary processes. For each group where the parallelism holds, I semiparametrically estimate its representative trend function and vertical shifts of group members, and establish their central limit theorems. An information criterion, consisting of in-group similarities and number of groups, is then proposed for the purpose of clustering. I prove its theoretical consistency and propose a splitting-coalescence algorithm to reduce the computational burden in practice. The method is illustrated by both simulation and a real-data example. Journal: Journal of the American Statistical Association Pages: 577-588 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2012.760458 File-URL: http://hdl.handle.net/10.1080/01621459.2012.760458 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:577-588 Template-Type: ReDIF-Article 1.0 Author-Name: Faming Liang Author-X-Name-First: Faming Author-X-Name-Last: Liang Author-Name: Qifan Song Author-X-Name-First: Qifan Author-X-Name-Last: Song Author-Name: Kai Yu Author-X-Name-First: Kai Author-X-Name-Last: Yu Title: Bayesian Subset Modeling for High-Dimensional Generalized Linear Models Abstract: This article presents a new prior setting for high-dimensional generalized linear models, which leads to a Bayesian subset regression (BSR) with the maximum a posteriori model approximately equivalent to the minimum extended Bayesian information criterion model. The consistency of the resulting posterior is established under mild conditions. Further, a variable screening procedure is proposed based on the marginal inclusion probability, which shares the same properties of sure screening and consistency with the existing sure independence screening (SIS) and iterative sure independence screening (ISIS) procedures. However, since the proposed procedure makes use of joint information from all predictors, it generally outperforms SIS and ISIS in real applications. This article also makes extensive comparisons of BSR with the popular penalized likelihood methods, including Lasso, elastic net, SIS, and ISIS. The numerical results indicate that BSR can generally outperform the penalized likelihood methods. The models selected by BSR tend to be sparser and, more importantly, of higher prediction ability. In addition, the performance of the penalized likelihood methods tends to deteriorate as the number of predictors increases, while this is not significant for BSR. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 589-606 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2012.761942 File-URL: http://hdl.handle.net/10.1080/01621459.2012.761942 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:589-606 Template-Type: ReDIF-Article 1.0 Author-Name: J. T. Gene Hwang Author-X-Name-First: J. T. Gene Author-X-Name-Last: Hwang Author-Name: Zhigen Zhao Author-X-Name-First: Zhigen Author-X-Name-Last: Zhao Title: Empirical Bayes Confidence Intervals for Selected Parameters in High-Dimensional Data Abstract: Modern statistical problems often involve a large number of populations and hence a large number of parameters that characterize these populations. It is common for scientists to use data to select the most significant populations, such as those with the largest t statistics. The scientific interest often lies in studying and making inferences regarding these parameters, called the selected parameters, corresponding to the selected populations. The current statistical practices either apply a traditional procedure assuming there were no selection-a practice that is not valid-or they use the Bonferroni-type procedure that is valid but very conservative and often noninformative. In this article, we propose valid and sharp confidence intervals that allow scientists to select parameters and to make inferences for the selected parameters based on the same data. This type of confidence interval allows the users to zero in on the most interesting selected parameters without collecting more data. The validity of confidence intervals is defined as the controlling of Bayes coverage probability so that it is no less than a nominal level uniformly over a class of prior distributions for the parameter. When a mixed model is assumed and the random effects are the key parameters, this validity criterion is exactly the frequentist criterion, since the Bayes coverage probability is identical to the frequentist coverage probability. Assuming that the observations are normally distributed with unequal and unknown variances, we select parameters with the largest t statistics. We then construct sharp empirical Bayes confidence intervals for these selected parameters, which have either a large Bayes coverage probability or a small Bayes false coverage rate uniformly for a class of priors. Our intervals, applicable to any high-dimensional data, are applied to microarray data and are shown to be better than all the alternatives. It is also anticipated that the same intervals would be valid for any selection rule. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 607-618 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.771102 File-URL: http://hdl.handle.net/10.1080/01621459.2013.771102 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:607-618 Template-Type: ReDIF-Article 1.0 Author-Name: Rong Liu Author-X-Name-First: Rong Author-X-Name-Last: Liu Author-Name: Lijian Yang Author-X-Name-First: Lijian Author-X-Name-Last: Yang Author-Name: Wolfgang K. Härdle Author-X-Name-First: Wolfgang K. Author-X-Name-Last: Härdle Title: Oracally Efficient Two-Step Estimation of Generalized Additive Model Abstract: The generalized additive model (GAM) is a multivariate nonparametric regression tool for non-Gaussian responses including binary and count data. We propose a spline-backfitted kernel (SBK) estimator for the component functions and the constant, which are oracally efficient under weak dependence. The SBK technique is both computationally expedient and theoretically reliable, thus usable for analyzing high-dimensional time series. Inference can be made on component functions based on asymptotic normality. Simulation evidence strongly corroborates the asymptotic theory. The method is applied to estimate insolvent probability and to obtain higher accuracy ratio than a previous study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 619-631 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.763726 File-URL: http://hdl.handle.net/10.1080/01621459.2013.763726 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:619-631 Template-Type: ReDIF-Article 1.0 Author-Name: Xueqin Wang Author-X-Name-First: Xueqin Author-X-Name-Last: Wang Author-Name: Yunlu Jiang Author-X-Name-First: Yunlu Author-X-Name-Last: Jiang Author-Name: Mian Huang Author-X-Name-First: Mian Author-X-Name-Last: Huang Author-Name: Heping Zhang Author-X-Name-First: Heping Author-X-Name-Last: Zhang Title: Robust Variable Selection With Exponential Squared Loss Abstract: Robust variable selection procedures through penalized regression have been gaining increased attention in the literature. They can be used to perform variable selection and are expected to yield robust estimates. However, to the best of our knowledge, the robustness of those penalized regression procedures has not been well characterized. In this article, we propose a class of penalized robust regression estimators based on exponential squared loss. The motivation for this new procedure is that it enables us to characterize its robustness in a way that has not been done for the existing procedures, while its performance is near optimal and superior to some recently developed methods. Specifically, under defined regularity conditions, our estimators are -consistent and possess the oracle property. Importantly, we show that our estimators can achieve the highest asymptotic breakdown point of 1/2 and that their influence functions are bounded with respect to the outliers in either the response or the covariate domain. We performed simulation studies to compare our proposed method with some recent methods, using the oracle method as the benchmark. We consider common sources of influential points. Our simulation studies reveal that our proposed method performs similarly to the oracle method in terms of the model error and the positive selection rate even in the presence of influential points. In contrast, other existing procedures have a much lower noncausal selection rate. Furthermore, we reanalyze the Boston Housing Price Dataset and the Plasma Beta-Carotene Level Dataset that are commonly used examples for regression diagnostics of influential points. Our analysis unravels the discrepancies of using our robust method versus the other penalized regression method, underscoring the importance of developing and applying robust penalized regression methods. Journal: Journal of the American Statistical Association Pages: 632-643 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.766613 File-URL: http://hdl.handle.net/10.1080/01621459.2013.766613 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:632-643 Template-Type: ReDIF-Article 1.0 Author-Name: Howard D. Bondell Author-X-Name-First: Howard D. Author-X-Name-Last: Bondell Author-Name: Leonard A. Stefanski Author-X-Name-First: Leonard A. Author-X-Name-Last: Stefanski Title: Efficient Robust Regression via Two-Stage Generalized Empirical Likelihood Abstract: Large- and finite-sample efficiency and resistance to outliers are the key goals of robust statistics. Although often not simultaneously attainable, we develop and study a linear regression estimator that comes close. Efficiency is obtained from the estimator's close connection to generalized empirical likelihood, and its favorable robustness properties are obtained by constraining the associated sum of (weighted) squared residuals. We prove maximum attainable finite-sample replacement breakdown point and full asymptotic efficiency for normal errors. Simulation evidence shows that compared to existing robust regression estimators, the new estimator has relatively high efficiency for small sample sizes and comparable outlier resistance. The estimator is further illustrated and compared to existing methods via application to a real dataset with purported outliers. Journal: Journal of the American Statistical Association Pages: 644-655 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.779847 File-URL: http://hdl.handle.net/10.1080/01621459.2013.779847 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:644-655 Template-Type: ReDIF-Article 1.0 Author-Name: Bo Li Author-X-Name-First: Bo Author-X-Name-Last: Li Author-Name: Marc G. Genton Author-X-Name-First: Marc G. Author-X-Name-Last: Genton Title: Nonparametric Identification of Copula Structures Abstract: We propose a unified framework for testing a variety of assumptions commonly made about the structure of copulas, including symmetry, radial symmetry, joint symmetry, associativity and Archimedeanity, and max-stability. Our test is nonparametric and based on the asymptotic distribution of the empirical copula process. We perform simulation experiments to evaluate our test and conclude that our method is reliable and powerful for assessing common assumptions on the structure of copulas, particularly when the sample size is moderately large. We illustrate our testing approach on two datasets. Journal: Journal of the American Statistical Association Pages: 666-675 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.787083 File-URL: http://hdl.handle.net/10.1080/01621459.2013.787083 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:666-675 Template-Type: ReDIF-Article 1.0 Author-Name: Hohsuk Noh Author-X-Name-First: Hohsuk Author-X-Name-Last: Noh Author-Name: Anouar El Ghouch Author-X-Name-First: Anouar El Author-X-Name-Last: Ghouch Author-Name: Taoufik Bouezmarni Author-X-Name-First: Taoufik Author-X-Name-Last: Bouezmarni Title: Copula-Based Regression Estimation and Inference Abstract: We investigate a new approach to estimating a regression function based on copulas. The main idea behind this approach is to write the regression function in terms of a copula and marginal distributions. Once the copula and the marginal distributions are estimated, we use the plug-in method to construct our new estimator. Because various methods are available in the literature for estimating both a copula and a distribution, this idea provides a rich and flexible family of regression estimators. We provide some asymptotic results related to this copula-based regression modeling when the copula is estimated via profile likelihood and the marginals are estimated nonparametrically. We also study the finite sample performance of the estimator and illustrate its usefulness by analyzing data from air pollution studies. Journal: Journal of the American Statistical Association Pages: 676-688 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.783842 File-URL: http://hdl.handle.net/10.1080/01621459.2013.783842 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:676-688 Template-Type: ReDIF-Article 1.0 Author-Name: Dong Hwan Oh Author-X-Name-First: Dong Hwan Author-X-Name-Last: Oh Author-Name: Andrew J. Patton Author-X-Name-First: Andrew J. Author-X-Name-Last: Patton Title: Simulated Method of Moments Estimation for Copula-Based Multivariate Models Abstract: This article considers the estimation of the parameters of a copula via a simulated method of moments (MM) type approach. This approach is attractive when the likelihood of the copula model is not known in closed form, or when the researcher has a set of dependence measures or other functionals of the copula that are of particular interest. The proposed approach naturally also nests MM and generalized method of moments estimators. Drawing on results for simulation-based estimation and on recent work in empirical copula process theory, we show the consistency and asymptotic normality of the proposed estimator, and obtain a simple test of overidentifying restrictions as a specification test. The results apply to both iid and time series data. We analyze the finite-sample behavior of these estimators in an extensive simulation study. We apply the model to a group of seven financial stock returns and find evidence of statistically significant tail dependence, and mild evidence that the dependence between these assets is stronger in crashes than booms. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 689-700 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.785952 File-URL: http://hdl.handle.net/10.1080/01621459.2013.785952 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:689-700 Template-Type: ReDIF-Article 1.0 Author-Name: Chunpeng Fan Author-X-Name-First: Chunpeng Author-X-Name-Last: Fan Author-Name: Jason P. Fine Author-X-Name-First: Jason P. Author-X-Name-Last: Fine Title: Linear Transformation Model With Parametric Covariate Transformations Abstract: The traditional linear transformation model assumes a linear relationship between the transformed response and the covariates. However, in real data, this linear relationship may be violated. We propose a linear transformation model that allows parametric covariate transformations to recover the linearity. Although the proposed generalization may seem rather simple, the inferential issues are quite challenging due to loss of identifiability under the null of no effects of transformed covariates. This article develops tests for such hypotheses. We establish rigorous inferences for parameters and the unspecified transformation function when the transformed covariates have nonzero effects. The estimates and tests perform well in simulation studies using a realistic sample size. We also develop goodness-of-fit tests for the transformation and R -super-2 for model comparison. GAGurine data are used to illustrate the practical utility of the proposed methods. Journal: Journal of the American Statistical Association Pages: 701-712 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.770707 File-URL: http://hdl.handle.net/10.1080/01621459.2013.770707 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:701-712 Template-Type: ReDIF-Article 1.0 Author-Name: Yunzhang Zhu Author-X-Name-First: Yunzhang Author-X-Name-Last: Zhu Author-Name: Xiaotong Shen Author-X-Name-First: Xiaotong Author-X-Name-Last: Shen Author-Name: Wei Pan Author-X-Name-First: Wei Author-X-Name-Last: Pan Title: Simultaneous Grouping Pursuit and Feature Selection Over an Undirected Graph Abstract: In high-dimensional regression, grouping pursuit and feature selection have their own merits while complementing each other in battling the curse of dimensionality. To seek a parsimonious model, we perform simultaneous grouping pursuit and feature selection over an arbitrary undirected graph with each node corresponding to one predictor. When the corresponding nodes are reachable from each other over the graph, regression coefficients can be grouped, whose absolute values are the same or close. This is motivated from gene network analysis, where genes tend to work in groups according to their biological functionalities. Through a nonconvex penalty, we develop a computational strategy and analyze the proposed method. Theoretical analysis indicates that the proposed method reconstructs the oracle estimator, that is, the unbiased least-square estimator given the true grouping, leading to consistent reconstruction of grouping structures and informative features, as well as to optimal parameter estimation. Simulation studies suggest that the method combines the benefit of grouping pursuit with that of feature selection, and compares favorably against its competitors in selection accuracy and predictive performance. An application to eQTL data is used to illustrate the methodology, where a network is incorporated into analysis through an undirected graph. Journal: Journal of the American Statistical Association Pages: 713-725 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.770704 File-URL: http://hdl.handle.net/10.1080/01621459.2013.770704 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:713-725 Template-Type: ReDIF-Article 1.0 Author-Name: Zhou Zhou Author-X-Name-First: Zhou Author-X-Name-Last: Zhou Title: Heteroscedasticity and Autocorrelation Robust Structural Change Detection Abstract: The assumption of (weak) stationarity is crucial for the validity of most of the conventional tests of structure change in time series. Under complicated nonstationary temporal dynamics, we argue that traditional testing procedures result in mixed structural change signals of the first and second order and hence could lead to biased testing results. The article proposes a simple and unified bootstrap testing procedure that provides consistent testing results under general forms of smooth and abrupt changes in the temporal dynamics of the time series. Monte Carlo experiments are performed to compare our testing procedure with various traditional tests. Our robust bootstrap test is applied to testing changes in an environmental and a financial time series and our procedure is shown to provide more reliable results than the conventional tests. Journal: Journal of the American Statistical Association Pages: 726-740 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.787184 File-URL: http://hdl.handle.net/10.1080/01621459.2013.787184 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:726-740 Template-Type: ReDIF-Article 1.0 Author-Name: Lawrence D. Brown Author-X-Name-First: Lawrence D. Author-X-Name-Last: Brown Author-Name: Eitan Greenshtein Author-X-Name-First: Eitan Author-X-Name-Last: Greenshtein Author-Name: Ya'acov Ritov Author-X-Name-First: Ya'acov Author-X-Name-Last: Ritov Title: The Poisson Compound Decision Problem Revisited Abstract: The compound decision problem for a vector of independent Poisson random variables with possibly different means has a half-century-old solution. However, it appears that the classical solution needs smoothing adjustment. We discuss three such adjustments. We also present another approach that first transforms the problem into the normal compound decision problem. A simulation study shows the effectiveness of the procedures in improving the performance over that of the classical procedure. A real data example is also provided. The procedures depend on a smoothness parameter that can be selected using a nonstandard cross-validation step, which is of independent interest. Finally, we mention some asymptotic results. Journal: Journal of the American Statistical Association Pages: 741-749 Issue: 502 Volume: 108 Year: 2013 Month: 6 X-DOI: 10.1080/01621459.2013.771582 File-URL: http://hdl.handle.net/10.1080/01621459.2013.771582 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:741-749 Template-Type: ReDIF-Article 1.0 Author-Name: Matt Taddy Author-X-Name-First: Matt Author-X-Name-Last: Taddy Title: Multinomial Inverse Regression for Text Analysis Abstract: Text data, including speeches, stories, and other document forms, are often connected to sentiment variables that are of interest for research in marketing, economics, and elsewhere. It is also very high dimensional and difficult to incorporate into statistical analyses. This article introduces a straightforward framework of sentiment-sufficient dimension reduction for text data. Multinomial inverse regression is introduced as a general tool for simplifying predictor sets that can be represented as draws from a multinomial distribution, and we show that logistic regression of phrase counts onto document annotations can be used to obtain low-dimensional document representations that are rich in sentiment information. To facilitate this modeling, a novel estimation technique is developed for multinomial logistic regression with very high-dimensional response. In particular, independent Laplace priors with unknown variance are assigned to each regression coefficient, and we detail an efficient routine for maximization of the joint posterior over coefficients and their prior scale. This "gamma-lasso" scheme yields stable and effective estimation for general high-dimensional logistic regression, and we argue that it will be superior to current methods in many settings. Guidelines for prior specification are provided, algorithm convergence is detailed, and estimator properties are outlined from the perspective of the literature on nonconcave likelihood penalization. Related work on sentiment analysis from statistics, econometrics, and machine learning is surveyed and connected. Finally, the methods are applied in two detailed examples and we provide out-of-sample prediction studies to illustrate their effectiveness. Journal: Journal of the American Statistical Association Pages: 755-770 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2012.734168 File-URL: http://hdl.handle.net/10.1080/01621459.2012.734168 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:755-770 Template-Type: ReDIF-Article 1.0 Author-Name: Justin Grimmer Author-X-Name-First: Justin Author-X-Name-Last: Grimmer Title: Comment Journal: Journal of the American Statistical Association Pages: 770-771 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.822383 File-URL: http://hdl.handle.net/10.1080/01621459.2013.822383 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:770-771 Template-Type: ReDIF-Article 1.0 Author-Name: David M. Blei Author-X-Name-First: David M. Author-X-Name-Last: Blei Title: Comment Journal: Journal of the American Statistical Association Pages: 771-772 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.827983 File-URL: http://hdl.handle.net/10.1080/01621459.2013.827983 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:771-772 Template-Type: ReDIF-Article 1.0 Author-Name: Matt Taddy Author-X-Name-First: Matt Author-X-Name-Last: Taddy Title: Rejoinder: Efficiency and Structure in MNIR Journal: Journal of the American Statistical Association Pages: 772-774 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.821408 File-URL: http://hdl.handle.net/10.1080/01621459.2013.821408 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:772-774 Template-Type: ReDIF-Article 1.0 Author-Name: Juhee Lee Author-X-Name-First: Juhee Author-X-Name-Last: Lee Author-Name: Peter Müller Author-X-Name-First: Peter Author-X-Name-Last: Müller Author-Name: Yitan Zhu Author-X-Name-First: Yitan Author-X-Name-Last: Zhu Author-Name: Yuan Ji Author-X-Name-First: Yuan Author-X-Name-Last: Ji Title: A Nonparametric Bayesian Model for Local Clustering With Application to Proteomics Abstract: We propose a nonparametric Bayesian local clustering (NoB-LoC) approach for heterogeneous data. NoB-LoC implements inference for nested clusters as posterior inference under a Bayesian model. Using protein expression data as an example, the NoB-LoC model defines a protein (column) cluster as a set of proteins that give rise to the same partition of the samples (rows). In other words, the sample partitions are nested within protein clusters. The common clustering of the samples gives meaning to the protein clusters. Any pair of samples might belong to the same cluster for one protein set but to different clusters for another protein set. These local features are different from features obtained by global clustering approaches such as hierarchical clustering, which create only one partition of samples that applies for all the proteins in the dataset. In addition, the NoB-LoC model is different from most other local or nested clustering methods, which define clusters based on common parameters in the sampling model. As an added and important feature, the NoB-LoC method probabilistically excludes sets of irrelevant proteins and samples that do not meaningfully cocluster with other proteins and samples, thus improving the inference on the clustering of the remaining proteins and samples. Inference is guided by a joint probability model for all the random elements. We provide a simulation study and a motivating example to demonstrate the unique features of the NoB-LoC model. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 775-788 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.784705 File-URL: http://hdl.handle.net/10.1080/01621459.2013.784705 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:775-788 Template-Type: ReDIF-Article 1.0 Author-Name: Peter B. Gilbert Author-X-Name-First: Peter B. Author-X-Name-Last: Gilbert Author-Name: Bryan E. Shepherd Author-X-Name-First: Bryan E. Author-X-Name-Last: Shepherd Author-Name: Michael G. Hudgens Author-X-Name-First: Michael G. Author-X-Name-Last: Hudgens Title: Sensitivity Analysis of Per-Protocol Time-to-Event Treatment Efficacy in Randomized Clinical Trials Abstract: Assessing per-protocol (PP) treatment efficacy on a time-to-event endpoint is a common objective of randomized clinical trials. The typical analysis uses the same method employed for the intention-to-treat analysis (e.g., standard survival analysis) applied to the subgroup meeting protocol adherence criteria. However, due to potential post-randomization selection bias, this analysis may mislead about treatment efficacy. Moreover, while there is extensive literature on methods for assessing causal treatment effects in compliers, these methods do not apply to a common class of trials where (a) the primary objective compares survival curves, (b) it is inconceivable to assign participants to be adherent and event free before adherence is measured, and (c) the exclusion restriction assumption fails to hold. HIV vaccine efficacy trials including the recent RV144 trial exemplify this class, because many primary endpoints (e.g., HIV infections) occur before adherence is measured, and nonadherent subjects who receive some of the planned immunizations may be partially protected. Therefore, we develop methods for assessing PP treatment efficacy for this problem class, considering three causal estimands of interest. Because these estimands are not identifiable from the observable data, we develop nonparametric bounds and semiparametric sensitivity analysis methods that yield estimated ignorance and uncertainty intervals. The methods are applied to RV144. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 789-800 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.786649 File-URL: http://hdl.handle.net/10.1080/01621459.2013.786649 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:789-800 Template-Type: ReDIF-Article 1.0 Author-Name: James Raymer Author-X-Name-First: James Author-X-Name-Last: Raymer Author-Name: Arkadiusz Wiśniowski Author-X-Name-First: Arkadiusz Author-X-Name-Last: Wiśniowski Author-Name: Jonathan J. Forster Author-X-Name-First: Jonathan J. Author-X-Name-Last: Forster Author-Name: Peter W. F. Smith Author-X-Name-First: Peter W. F. Author-X-Name-Last: Smith Author-Name: Jakub Bijak Author-X-Name-First: Jakub Author-X-Name-Last: Bijak Title: Integrated Modeling of European Migration Abstract: International migration data in Europe are collected by individual countries with separate collection systems and designs. As a result, reported data are inconsistent in availability, definition, and quality. In this article, we propose a Bayesian model to overcome the limitations of the various data sources. The focus is on estimating recent international migration flows among 31 countries in the European Union and European Free Trade Association from 2002 to 2008, using data collated by Eurostat. We also incorporate covariate information and information provided by experts on the effects of undercount, measurement, and accuracy of data collection systems. The methodology is integrated and produces a synthetic database with measures of uncertainty for international migration flows and other model parameters. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 801-819 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.789435 File-URL: http://hdl.handle.net/10.1080/01621459.2013.789435 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:801-819 Template-Type: ReDIF-Article 1.0 Author-Name: Brian J. Reich Author-X-Name-First: Brian J. Author-X-Name-Last: Reich Author-Name: Dipankar Bandyopadhyay Author-X-Name-First: Dipankar Author-X-Name-Last: Bandyopadhyay Author-Name: Howard D. Bondell Author-X-Name-First: Howard D. Author-X-Name-Last: Bondell Title: A Nonparametric Spatial Model for Periodontal Data With Nonrandom Missingness Abstract: Periodontal disease (PD) progression is often quantified by clinical attachment level (CAL) defined as the distance down a tooth's root that is detached from the surrounding bone. Measured at six locations per tooth throughout the mouth (excluding the molars), it gives rise to a dependent data setup. These data are often reduced to a one-number summary, such as the whole-mouth average or the number of observations greater than a threshold, to be used as the response in a regression to identify important covariates related to the current state of a subject's periodontal health. Rather than a simple one-number summary, we set forward to analyze all available CAL data for each subject, exploiting the presence of spatial dependence, nonstationarity, and nonnormality. Also, many subjects have a considerable proportion of missing teeth, which cannot be considered missing at random because PD is the leading cause of adult tooth loss. Under a Bayesian paradigm, we propose a nonparametric flexible spatial (joint) model of observed CAL and the location of missing tooth via kernel convolution methods, incorporating the aforementioned features of CAL data under a unified framework. Application of this methodology to a dataset recording the periodontal health of an African-American population, as well as simulation studies reveal the gain in model fit and inference, and provides a new perspective into unraveling covariate--response relationships in the presence of complexities posed by these data. Journal: Journal of the American Statistical Association Pages: 820-831 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.795487 File-URL: http://hdl.handle.net/10.1080/01621459.2013.795487 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:820-831 Template-Type: ReDIF-Article 1.0 Author-Name: Jason L. Morrissette Author-X-Name-First: Jason L. Author-X-Name-Last: Morrissette Author-Name: Michael P. Mcdermott Author-X-Name-First: Michael P. Author-X-Name-Last: Mcdermott Title: Estimation and Inference Concerning Ordered Means in Analysis of Covariance Models With Interactions Abstract: When interactions are identified in analysis of covariance models, it becomes important to identify values of the covariates for which there are significant differences or, more generally, significant contrasts among the group mean responses. Inferential procedures that incorporate a priori order restrictions among the group mean responses would be expected to be superior to those that ignore this information. In this article, we focus on analysis of covariance models with prespecified order restrictions on the mean response across the levels of a grouping variable when the grouping variable may interact with model covariates. In order for the restrictions to hold in the presence of interactions, it is necessary to impose the requirement that the restrictions hold over all levels of interacting categorical covariates and across prespecified ranges of interacting continuous covariates. The parameter estimation procedure involves solving a quadratic programming minimization problem with a carefully specified constraint matrix. Simultaneous confidence intervals for treatment group contrasts and tests for equality of the ordered group mean responses are determined by exploiting previously unconnected literature. The proposed methods are motivated by a clinical trial of the dopamine agonist pramipexole for the treatment of early-stage Parkinson's disease. Journal: Journal of the American Statistical Association Pages: 832-839 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.797355 File-URL: http://hdl.handle.net/10.1080/01621459.2013.797355 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:832-839 Template-Type: ReDIF-Article 1.0 Author-Name: Roland Langrock Author-X-Name-First: Roland Author-X-Name-Last: Langrock Author-Name: David L. Borchers Author-X-Name-First: David L. Author-X-Name-Last: Borchers Author-Name: Hans J. Skaug Author-X-Name-First: Hans J. Author-X-Name-Last: Skaug Title: Markov-Modulated Nonhomogeneous Poisson Processes for Modeling Detections in Surveys of Marine Mammal Abundance Abstract: We consider Markov-modulated nonhomogeneous Poisson processes for modeling sightings of marine mammals in shipboard or aerial surveys. In such surveys, detection of an animal is possible only when it surfaces, and with some species a substantial proportion of animals is missed because they are diving and thus not available for detection. This needs to be adequately accounted for to avoid biased abundance estimates. The tendency of surfacing events of marine mammals to occur in clusters motivates consideration of the flexible class of Markov-modulated Poisson processes in this context. We embed these models in distance sampling models, introducing nonhomogeneity in the process to account for the fact that the observer's probability of detecting an animal decreases with increasing distance to the animal. We derive approximate expressions for the likelihood of Markov-modulated nonhomogeneous Poisson processes that enable us to estimate the model parameters through numerical maximum likelihood. The performance of the approach is investigated in an extensive simulation study, and applications to pilot and beaked whale tag data as well as to minke whale tag and survey data demonstrate its relevance in abundance estimation. Journal: Journal of the American Statistical Association Pages: 840-851 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.797356 File-URL: http://hdl.handle.net/10.1080/01621459.2013.797356 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:840-851 Template-Type: ReDIF-Article 1.0 Author-Name: Jonathan Rougier Author-X-Name-First: Jonathan Author-X-Name-Last: Rougier Author-Name: Michael Goldstein Author-X-Name-First: Michael Author-X-Name-Last: Goldstein Author-Name: Leanna House Author-X-Name-First: Leanna Author-X-Name-Last: House Title: Second-Order Exchangeability Analysis for Multimodel Ensembles Abstract: The challenge of understanding complex systems often gives rise to a multiplicity of models. It is natural to consider whether the outputs of these models can be combined to produce a system prediction that is more informative than the output of any one of the models taken in isolation. And, in particular, to consider the relationship between the spread of model outputs and system uncertainty. We describe a statistical framework for such a combination, based on the exchangeability of the models, and their coexchangeability with the system. We demonstrate the simplest implementation of our framework in the context of climate prediction. Throughout we work entirely in means and variances to avoid the necessity of specifying higher-order quantities for which we often lack well-founded judgments. Journal: Journal of the American Statistical Association Pages: 852-863 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.802963 File-URL: http://hdl.handle.net/10.1080/01621459.2013.802963 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:852-863 Template-Type: ReDIF-Article 1.0 Author-Name: Tao Yu Author-X-Name-First: Tao Author-X-Name-Last: Yu Author-Name: Pengfei Li Author-X-Name-First: Pengfei Author-X-Name-Last: Li Title: Spatial Shrinkage Estimation of Diffusion Tensors on Diffusion-Weighted Imaging Data Abstract: Diffusion tensor imaging (DTI), based on the diffusion-weighted imaging (DWI) data acquired from magnetic resonance experiments, has been widely used to analyze the physical structure of white-matter fibers in the human brain in vivo. The raw DWI data, however, carry noise; this contaminates the diffusion tensor (DT) estimates and introduces systematic bias into the induced eigenvalues. These bias components affect the effectiveness of fiber-tracking algorithms. In this article, we propose a two-stage spatial shrinkage estimation (SpSkE) procedure to accommodate the spatial information carried in DWI data in DT estimation and to reduce the bias components in the corresponding derived eigenvalues. To this end, in the framework of the heteroscedastic linear model, SpSkE incorporates L 1-type penalization and the locally weighted least-square function. The theoretical properties of SpSkE are explored. The effectiveness of SpSkE is further illustrated by simulation and real-data examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 864-875 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.804408 File-URL: http://hdl.handle.net/10.1080/01621459.2013.804408 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:864-875 Template-Type: ReDIF-Article 1.0 Author-Name: Francesco C. Stingo Author-X-Name-First: Francesco C. Author-X-Name-Last: Stingo Author-Name: Michele Guindani Author-X-Name-First: Michele Author-X-Name-Last: Guindani Author-Name: Marina Vannucci Author-X-Name-First: Marina Author-X-Name-Last: Vannucci Author-Name: Vince D. Calhoun Author-X-Name-First: Vince D. Author-X-Name-Last: Calhoun Title: An Integrative Bayesian Modeling Approach to Imaging Genetics Abstract: In this article we present a Bayesian hierarchical modeling approach for imaging genetics, where the interest lies in linking brain connectivity across multiple individuals to their genetic information. We have available data from a functional magnetic resonance imaging (fMRI) study on schizophrenia. Our goals are to identify brain regions of interest (ROIs) with discriminating activation patterns between schizophrenic patients and healthy controls, and to relate the ROIs' activations with available genetic information from single nucleotide polymorphisms (SNPs) on the subjects. For this task, we develop a hierarchical mixture model that includes several innovative characteristics: it incorporates the selection of ROIs that discriminate the subjects into separate groups; it allows the mixture components to depend on selected covariates; it includes prior models that capture structural dependencies among the ROIs. Applied to the schizophrenia dataset, the model leads to the simultaneous selection of a set of discriminatory ROIs and the relevant SNPs, together with the reconstruction of the correlation structure of the selected regions. To the best of our knowledge, our work represents the first attempt at a rigorous modeling strategy for imaging genetics data that incorporates all such features. Journal: Journal of the American Statistical Association Pages: 876-891 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.804409 File-URL: http://hdl.handle.net/10.1080/01621459.2013.804409 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:876-891 Template-Type: ReDIF-Article 1.0 Author-Name: Jin Zhang Author-X-Name-First: Jin Author-X-Name-Last: Zhang Author-Name: Thomas M. Braun Author-X-Name-First: Thomas M. Author-X-Name-Last: Braun Title: A Phase I Bayesian Adaptive Design to Simultaneously Optimize Dose and Schedule Assignments Both Between and Within Patients Abstract: In traditional schedule or dose--schedule finding designs, patients are assumed to receive their assigned dose--schedule combination throughout the trial even though the combination may be found to have an undesirable toxicity profile, which contradicts actual clinical practice. Since no systematic approach exists to optimize intrapatient dose--schedule assignment, we propose a Phase I clinical trial design that extends existing approaches to optimize dose and schedule solely between patients by incorporating adaptive variations to dose--schedule assignments within patients as the study proceeds. Our design is based on a Bayesian nonmixture cure rate model that incorporates multiple administrations each patient receives with the per-administration dose included as a covariate. Simulations demonstrate that our design identifies safe dose and schedule combinations as well as the traditional method that does not allow for intrapatient dose--schedule reassignments, but with a larger number of patients assigned to safe combinations. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 892-901 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.806927 File-URL: http://hdl.handle.net/10.1080/01621459.2013.806927 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:892-901 Template-Type: ReDIF-Article 1.0 Author-Name: Jerry Q. Cheng Author-X-Name-First: Jerry Q. Author-X-Name-Last: Cheng Author-Name: Minge Xie Author-X-Name-First: Minge Author-X-Name-Last: Xie Author-Name: Rong Chen Author-X-Name-First: Rong Author-X-Name-Last: Chen Author-Name: Fred Roberts Author-X-Name-First: Fred Author-X-Name-Last: Roberts Title: A Latent Source Model to Detect Multiple Spatial Clusters With Application in a Mobile Sensor Network for Surveillance of Nuclear Materials Abstract: Potential nuclear attacks are among the most devastating terrorist attacks, with severe loss of human lives as well as damage to infrastructure. To deter such threats, it becomes increasingly vital to have sophisticated nuclear surveillance and detection systems deployed in major cities in the United States, such as New York City. In this article, we design a mobile sensor network and develop statistical algorithms and models to provide consistent and pervasive surveillance of nuclear materials in major cities. The network consists of a large number of vehicles on which nuclear sensors and Global Position System (GPS) tracking devices are installed. Real time sensor readings and GPS information are transmitted to and processed at a central surveillance center. Mathematical and statistical analyses are performed, in which we mimic a signal-generating process and develop a latent source modeling framework to detect multiple spatial clusters. A Monte Carlo expectation-maximization algorithm is developed to estimate model parameters, detect significant clusters, and identify their locations and sizes. We also determine the number of clusters using a modified Akaike Information Criterion/Bayesian Information Criterion. Simulation studies to evaluate the effectiveness and detection power of such a network are described. Journal: Journal of the American Statistical Association Pages: 902-913 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.808945 File-URL: http://hdl.handle.net/10.1080/01621459.2013.808945 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:902-913 Template-Type: ReDIF-Article 1.0 Author-Name: Yichen Qin Author-X-Name-First: Yichen Author-X-Name-Last: Qin Author-Name: Carey E. Priebe Author-X-Name-First: Carey E. Author-X-Name-Last: Priebe Title: Maximum Lq-Likelihood Estimation via the Expectation-Maximization Algorithm: A Robust Estimation of Mixture Models Abstract: We introduce a maximum Lq-likelihood estimation (MLqE) of mixture models using our proposed expectation-maximization (EM) algorithm, namely the EM algorithm with Lq-likelihood (EM-Lq). Properties of the MLqE obtained from the proposed EM-Lq are studied through simulated mixture model data. Compared with the maximum likelihood estimation (MLE), which is obtained from the EM algorithm, the MLqE provides a more robust estimation against outliers for small sample sizes. In particular, we study the performance of the MLqE in the context of the gross error model, where the true model of interest is a mixture of two normal distributions, and the contamination component is a third normal distribution with a large variance. A numerical comparison between the MLqE and the MLE for this gross error model is presented in terms of Kullback--Leibler (KL) distance and relative efficiency. Journal: Journal of the American Statistical Association Pages: 914-928 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.787933 File-URL: http://hdl.handle.net/10.1080/01621459.2013.787933 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:914-928 Template-Type: ReDIF-Article 1.0 Author-Name: Mian Huang Author-X-Name-First: Mian Author-X-Name-Last: Huang Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Author-Name: Shaoli Wang Author-X-Name-First: Shaoli Author-X-Name-Last: Wang Title: Nonparametric Mixture of Regression Models Abstract: Motivated by an analysis of U.S. house price index (HPI) data, we propose nonparametric finite mixture of regression models. We study the identifiability issue of the proposed models, and develop an estimation procedure by employing kernel regression. We further systematically study the sampling properties of the proposed estimators, and establish their asymptotic normality. A modified EM algorithm is proposed to carry out the estimation procedure. We show that our algorithm preserves the ascent property of the EM algorithm in an asymptotic sense. Monte Carlo simulations are conducted to examine the finite sample performance of the proposed estimation procedure. An empirical analysis of the U.S. HPI data is illustrated for the proposed methodology. Journal: Journal of the American Statistical Association Pages: 929-941 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.772897 File-URL: http://hdl.handle.net/10.1080/01621459.2013.772897 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:929-941 Template-Type: ReDIF-Article 1.0 Author-Name: Mahbubul Majumder Author-X-Name-First: Mahbubul Author-X-Name-Last: Majumder Author-Name: Heike Hofmann Author-X-Name-First: Heike Author-X-Name-Last: Hofmann Author-Name: Dianne Cook Author-X-Name-First: Dianne Author-X-Name-Last: Cook Title: Validation of Visual Statistical Inference, Applied to Linear Models Abstract: Statistical graphics play a crucial role in exploratory data analysis, model checking, and diagnosis. The lineup protocol enables statistical significance testing of visual findings, bridging the gulf between exploratory and inferential statistics. In this article, inferential methods for statistical graphics are developed further by refining the terminology of visual inference and framing the lineup protocol in a context that allows direct comparison with conventional tests in scenarios when a conventional test exists. This framework is used to compare the performance of the lineup protocol against conventional statistical testing in the scenario of fitting linear models. A human subjects experiment is conducted using simulated data to provide controlled conditions. Results suggest that the lineup protocol performs comparably with the conventional tests, and expectedly outperforms them when data are contaminated, a scenario where assumptions required for performing a conventional test are violated. Surprisingly, visual tests have higher power than the conventional tests when the effect size is large. And, interestingly, there may be some super-visual individuals who yield better performance and power than the conventional test even in the most difficult tasks. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 942-956 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.808157 File-URL: http://hdl.handle.net/10.1080/01621459.2013.808157 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:942-956 Template-Type: ReDIF-Article 1.0 Author-Name: Susan Wei Author-X-Name-First: Susan Author-X-Name-Last: Wei Author-Name: Michael R. Kosorok Author-X-Name-First: Michael R. Author-X-Name-Last: Kosorok Title: Latent Supervised Learning Abstract: This article introduces a new machine learning task, called latent supervised learning, where the goal is to learn a binary classifier from continuous training labels that serve as surrogates for the unobserved class labels. We investigate a specific model where the surrogate variable arises from a two-component Gaussian mixture with unknown means and variances, and the component membership is determined by a hyperplane in the covariate space. The estimation of the separating hyperplane and the Gaussian mixture parameters forms what shall be referred to as the change-line classification problem. We propose a data-driven sieve maximum likelihood estimator for the hyperplane, which in turn can be used to estimate the parameters of the Gaussian mixture. The estimator is shown to be consistent. Simulations as well as empirical data show the estimator has high classification accuracy. Journal: Journal of the American Statistical Association Pages: 957-970 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.789695 File-URL: http://hdl.handle.net/10.1080/01621459.2013.789695 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:957-970 Template-Type: ReDIF-Article 1.0 Author-Name: Colin O. Wu Author-X-Name-First: Colin O. Author-X-Name-Last: Wu Author-Name: Xin Tian Author-X-Name-First: Xin Author-X-Name-Last: Tian Title: Nonparametric Estimation of Conditional Distributions and Rank-Tracking Probabilities With Time-Varying Transformation Models in Longitudinal Studies Abstract: An objective of longitudinal analysis is to estimate the conditional distributions of an outcome variable through a regression model. The approaches based on modeling the conditional means are not appropriate for this task when the conditional distributions are skewed or cannot be approximated by a normal distribution through a known transformation. We study a class of time-varying transformation models and a two-step smoothing method for the estimation of the conditional distribution functions. Based on our models, we propose a rank-tracking probability and a rank-tracking probability ratio to measure the strength of tracking ability of an outcome variable at two different time points. Our models and estimation method can be applied to a wide range of scientific objectives that cannot be evaluated by the conditional mean-based models. We derive the asymptotic properties for the two-step local polynomial estimators of the conditional distribution functions. Finite sample properties of our procedures are investigated through a simulation study. Application of our models and estimation method is demonstrated through an epidemiological study of childhood growth and blood pressure. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 971-982 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.808949 File-URL: http://hdl.handle.net/10.1080/01621459.2013.808949 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:971-982 Template-Type: ReDIF-Article 1.0 Author-Name: Xiaoke Zhang Author-X-Name-First: Xiaoke Author-X-Name-Last: Zhang Author-Name: Byeong U. Park Author-X-Name-First: Byeong U. Author-X-Name-Last: Park Author-Name: Jane-ling Wang Author-X-Name-First: Jane-ling Author-X-Name-Last: Wang Title: Time-Varying Additive Models for Longitudinal Data Abstract: The additive model is an effective dimension-reduction approach that also provides flexibility in modeling the relation between a response variable and key covariates. The literature is largely developed to scalar response and vector covariates. In this article, more complex data are of interest, where both the response and the covariates are functions. We propose a functional additive model together with a new backfitting algorithm to estimate the unknown regression functions, whose components are time-dependent additive functions of the covariates. Such functional data may not be completely observed since measurements may only be collected intermittently at discrete time points. We develop a unified platform and an efficient approach that can cover both dense and sparse functional data and the needed theory for statistical inference. We also establish the oracle properties of the proposed estimators of the component functions. Journal: Journal of the American Statistical Association Pages: 983-998 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.778776 File-URL: http://hdl.handle.net/10.1080/01621459.2013.778776 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:983-998 Template-Type: ReDIF-Article 1.0 Author-Name: P. Richard Hahn Author-X-Name-First: P. Richard Author-X-Name-Last: Hahn Author-Name: Carlos M. Carvalho Author-X-Name-First: Carlos M. Author-X-Name-Last: Carvalho Author-Name: Sayan Mukherjee Author-X-Name-First: Sayan Author-X-Name-Last: Mukherjee Title: Partial Factor Modeling: Predictor-Dependent Shrinkage for Linear Regression Abstract: We develop a modified Gaussian factor model for the purpose of inducing predictor-dependent shrinkage for linear regression. The new model predicts well across a wide range of covariance structures, on real and simulated data. Furthermore, the new model facilitates variable selection in the case of correlated predictor variables, which often stymies other methods. Journal: Journal of the American Statistical Association Pages: 999-1008 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.779843 File-URL: http://hdl.handle.net/10.1080/01621459.2013.779843 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:999-1008 Template-Type: ReDIF-Article 1.0 Author-Name: Xiaolei Xun Author-X-Name-First: Xiaolei Author-X-Name-Last: Xun Author-Name: Jiguo Cao Author-X-Name-First: Jiguo Author-X-Name-Last: Cao Author-Name: Bani Mallick Author-X-Name-First: Bani Author-X-Name-Last: Mallick Author-Name: Arnab Maity Author-X-Name-First: Arnab Author-X-Name-Last: Maity Author-Name: Raymond J. Carroll Author-X-Name-First: Raymond J. Author-X-Name-Last: Carroll Title: Parameter Estimation of Partial Differential Equation Models Abstract: Partial differential equation (PDE) models are commonly used to model complex dynamic systems in applied sciences such as biology and finance. The forms of these PDE models are usually proposed by experts based on their prior knowledge and understanding of the dynamic system. Parameters in PDE models often have interesting scientific interpretations, but their values are often unknown and need to be estimated from the measurements of the dynamic system in the presence of measurement errors. Most PDEs used in practice have no analytic solutions, and can only be solved with numerical methods. Currently, methods for estimating PDE parameters require repeatedly solving PDEs numerically under thousands of candidate parameter values, and thus the computational load is high. In this article, we propose two methods to estimate parameters in PDE models: a parameter cascading method and a Bayesian approach. In both methods, the underlying dynamic process modeled with the PDE model is represented via basis function expansion. For the parameter cascading method, we develop two nested levels of optimization to estimate the PDE parameters. For the Bayesian method, we develop a joint model for data and the PDE and develop a novel hierarchical model allowing us to employ Markov chain Monte Carlo (MCMC) techniques to make posterior inference. Simulation studies show that the Bayesian method and parameter cascading method are comparable, and both outperform other available methods in terms of estimation accuracy. The two methods are demonstrated by estimating parameters in a PDE model from long-range infrared light detection and ranging data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1009-1020 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.794730 File-URL: http://hdl.handle.net/10.1080/01621459.2013.794730 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1009-1020 Template-Type: ReDIF-Article 1.0 Author-Name: Stéphane Guerrier Author-X-Name-First: Stéphane Author-X-Name-Last: Guerrier Author-Name: Jan Skaloud Author-X-Name-First: Jan Author-X-Name-Last: Skaloud Author-Name: Yannick Stebler Author-X-Name-First: Yannick Author-X-Name-Last: Stebler Author-Name: Maria-Pia Victoria-Feser Author-X-Name-First: Maria-Pia Author-X-Name-Last: Victoria-Feser Title: Wavelet-Variance-Based Estimation for Composite Stochastic Processes Abstract: This article presents a new estimation method for the parameters of a time series model. We consider here composite Gaussian processes that are the sum of independent Gaussian processes which, in turn, explain an important aspect of the time series, as is the case in engineering and natural sciences. The proposed estimation method offers an alternative to classical estimation based on the likelihood, that is straightforward to implement and often the only feasible estimation method with complex models. The estimator furnishes results as the optimization of a criterion based on a standardized distance between the sample wavelet variances (WV) estimates and the model-based WV. Indeed, the WV provides a decomposition of the variance process through different scales, so that they contain the information about different features of the stochastic model. We derive the asymptotic properties of the proposed estimator for inference and perform a simulation study to compare our estimator to the MLE and the LSE with different models. We also set sufficient conditions on composite models for our estimator to be consistent, that are easy to verify. We use the new estimator to estimate the stochastic error's parameters of the sum of three first order Gauss--Markov processes by means of a sample of over 800, 000 issued from gyroscopes that compose inertial navigation systems. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1021-1030 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.799920 File-URL: http://hdl.handle.net/10.1080/01621459.2013.799920 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1021-1030 Template-Type: ReDIF-Article 1.0 Author-Name: Cheryl J. Flynn Author-X-Name-First: Cheryl J. Author-X-Name-Last: Flynn Author-Name: Clifford M. Hurvich Author-X-Name-First: Clifford M. Author-X-Name-Last: Hurvich Author-Name: Jeffrey S. Simonoff Author-X-Name-First: Jeffrey S. Author-X-Name-Last: Simonoff Title: Efficiency for Regularization Parameter Selection in Penalized Likelihood Estimation of Misspecified Models Abstract: It has been shown that Akaike information criterion (AIC)-type criteria are asymptotically efficient selectors of the tuning parameter in nonconcave penalized regression methods under the assumption that the population variance is known or that a consistent estimator is available. We relax this assumption to prove that AIC itself is asymptotically efficient and we study its performance in finite samples. In classical regression, it is known that AIC tends to select overly complex models when the dimension of the maximum candidate model is large relative to the sample size. Simulation studies suggest that AIC suffers from the same shortcomings when used in penalized regression. We therefore propose the use of the classical corrected AIC (AIC c ) as an alternative and prove that it maintains the desired asymptotic properties. To broaden our results, we further prove the efficiency of AIC for penalized likelihood methods in the context of generalized linear models with no dispersion parameter. Similar results exist in the literature but only for a restricted set of candidate models. By employing results from the classical literature on maximum-likelihood estimation in misspecified models, we are able to establish this result for a general set of candidate models. We use simulations to assess the performance of AIC and AIC c , as well as that of other selectors, in finite samples for both smoothly clipped absolute deviation (SCAD)-penalized and Lasso regressions and a real data example is considered. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1031-1043 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.801775 File-URL: http://hdl.handle.net/10.1080/01621459.2013.801775 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1031-1043 Template-Type: ReDIF-Article 1.0 Author-Name: Yingying Fan Author-X-Name-First: Yingying Author-X-Name-Last: Fan Author-Name: Jinchi Lv Author-X-Name-First: Jinchi Author-X-Name-Last: Lv Title: Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space Abstract: High-dimensional data analysis has motivated a spectrum of regularization methods for variable selection and sparse modeling, with two popular methods being convex and concave ones. A long debate has taken place on whether one class dominates the other, an important question both in theory and to practitioners. In this article, we characterize the asymptotic equivalence of regularization methods, with general penalty functions, in a thresholded parameter space under the generalized linear model setting, where the dimensionality can grow exponentially with the sample size. To assess their performance, we establish the oracle inequalities-as in Bickel, Ritov, and Tsybakov (2009)-of the global minimizer for these methods under various prediction and variable selection losses. These results reveal an interesting phase transition phenomenon. For polynomially growing dimensionality, the L 1-regularization method of Lasso and concave methods are asymptotically equivalent, having the same convergence rates in the oracle inequalities. For exponentially growing dimensionality, concave methods are asymptotically equivalent but have faster convergence rates than the Lasso. We also establish a stronger property of the oracle risk inequalities of the regularization methods, as well as the sampling properties of computable solutions. Our new theoretical results are illustrated and justified by simulation and real data examples. Journal: Journal of the American Statistical Association Pages: 1044-1061 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.803972 File-URL: http://hdl.handle.net/10.1080/01621459.2013.803972 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1044-1061 Template-Type: ReDIF-Article 1.0 Author-Name: Huixia Judy Wang Author-X-Name-First: Huixia Judy Author-X-Name-Last: Wang Author-Name: Deyuan Li Author-X-Name-First: Deyuan Author-X-Name-Last: Li Title: Estimation of Extreme Conditional Quantiles Through Power Transformation Abstract: The estimation of extreme conditional quantiles is an important issue in numerous disciplines. Quantile regression (QR) provides a natural way to capture the covariate effects at different tails of the response distribution. However, without any distributional assumptions, estimation from conventional QR is often unstable at the tails, especially for heavy-tailed distributions due to data sparsity. In this article, we develop a new three-stage estimation procedure that integrates QR and extreme value theory by estimating intermediate conditional quantiles using QR and extrapolating these estimates to tails based on extreme value theory. Using the power-transformed QR, the proposed method allows more flexibility than existing methods that rely on the linearity of quantiles on the original scale, while extending the applicability of parametric models to borrow information across covariates without resorting to nonparametric smoothing. In addition, we propose a test procedure to assess the commonality of extreme value index, which could be useful for obtaining more efficient estimation by sharing information across covariates. We establish the asymptotic properties of the proposed method and demonstrate its value through simulation study and the analysis of a medical cost data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1062-1074 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.820134 File-URL: http://hdl.handle.net/10.1080/01621459.2013.820134 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1062-1074 Template-Type: ReDIF-Article 1.0 Author-Name: Antonio F. Galvao Author-X-Name-First: Antonio F. Author-X-Name-Last: Galvao Author-Name: Carlos Lamarche Author-X-Name-First: Carlos Author-X-Name-Last: Lamarche Author-Name: Luiz Renato Lima Author-X-Name-First: Luiz Renato Author-X-Name-Last: Lima Title: Estimation of Censored Quantile Regression for Panel Data With Fixed Effects Abstract: This article investigates estimation of censored quantile regression (QR) models with fixed effects. Standard available methods are not appropriate for estimation of a censored QR model with a large number of parameters or with covariates correlated with unobserved individual heterogeneity. Motivated by these limitations, the article proposes estimators that are obtained by applying fixed effects QR to subsets of observations selected either parametrically or nonparametrically. We derive the limiting distribution of the new estimators under joint limits, and conduct Monte Carlo simulations to assess their small sample performance. An empirical application of the method to study the impact of the 1964 Civil Rights Act on the black--white earnings gap is considered. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1075-1089 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.818002 File-URL: http://hdl.handle.net/10.1080/01621459.2013.818002 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1075-1089 Template-Type: ReDIF-Article 1.0 Author-Name: Yanyuan Ma Author-X-Name-First: Yanyuan Author-X-Name-Last: Ma Author-Name: Mijeong Kim Author-X-Name-First: Mijeong Author-X-Name-Last: Kim Author-Name: Marc G. Genton Author-X-Name-First: Marc G. Author-X-Name-Last: Genton Title: Semiparametric Efficient and Robust Estimation of an Unknown Symmetric Population Under Arbitrary Sample Selection Bias Abstract: We propose semiparametric methods to estimate the center and shape of a symmetric population when a representative sample of the population is unavailable due to selection bias. We allow an arbitrary sample selection mechanism determined by the data collection procedure, and we do not impose any parametric form on the population distribution. Under this general framework, we construct a family of consistent estimators of the center that is robust to population model misspecification, and we identify the efficient member that reaches the minimum possible estimation variance. The asymptotic properties and finite sample performance of the estimation and inference procedures are illustrated through theoretical analysis and simulations. A data example is also provided to illustrate the usefulness of the methods in practice. Journal: Journal of the American Statistical Association Pages: 1090-1104 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.816184 File-URL: http://hdl.handle.net/10.1080/01621459.2013.816184 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1090-1104 Template-Type: ReDIF-Article 1.0 Author-Name: Davy Paindaveine Author-X-Name-First: Davy Author-X-Name-Last: Paindaveine Author-Name: Germain Van bever Author-X-Name-First: Germain Author-X-Name-Last: Van bever Title: From Depth to Local Depth: A Focus on Centrality Abstract: Aiming at analyzing multimodal or nonconvexly supported distributions through data depth, we introduce a local extension of depth. Our construction is obtained by conditioning the distribution to appropriate depth-based neighborhoods and has the advantages, among others, of maintaining affine-invariance and applying to all depths in a generic way. Most importantly, unlike their competitors, which (for extreme localization) rather measure probability mass, the resulting local depths focus on centrality and remain of a genuine depth nature at any locality level. We derive their main properties, establish consistency of their sample versions, and study their behavior under extreme localization. We present two applications of the proposed local depth (for classification and for symmetry testing), and we extend our construction to the regression depth context. Throughout, we illustrate the results on several datasets, both artificial and real, univariate and multivariate. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1105-1119 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.813390 File-URL: http://hdl.handle.net/10.1080/01621459.2013.813390 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1105-1119 Template-Type: ReDIF-Article 1.0 Author-Name: Fabrizia Mealli Author-X-Name-First: Fabrizia Author-X-Name-Last: Mealli Author-Name: Barbara Pacini Author-X-Name-First: Barbara Author-X-Name-Last: Pacini Title: Using Secondary Outcomes to Sharpen Inference in Randomized Experiments With Noncompliance Abstract: We develop new methods for analyzing randomized experiments with noncompliance and, by extension, instrumental variable settings, when the often controversial, but key, exclusion restriction assumption is violated. We show how existing large-sample bounds on intention-to-treat effects for the subpopulations of compliers, never-takers, and always-takers can be tightened by exploiting the joint distribution of the outcome of interest and a secondary outcome, for which the exclusion restriction is satisfied. The derived bounds can be used to detect violations of the exclusion restriction and the magnitude of these violations in instrumental variables settings. It is shown that the reduced width of the bounds depends on the strength of the association of the auxiliary variable with the primary outcome and the compliance status. We also show how the setup we consider offers new identifying assumptions of intention-to-treat effects. The role of the auxiliary information is shown in two examples of a real social job training experiment and a simulated medical randomized encouragement study. We also discuss issues of inference in finite samples and show how to conduct Bayesian analysis in our partial and point identified settings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1120-1131 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.802238 File-URL: http://hdl.handle.net/10.1080/01621459.2013.802238 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1120-1131 Template-Type: ReDIF-Article 1.0 Author-Name: Ryan Martin Author-X-Name-First: Ryan Author-X-Name-Last: Martin Author-Name: Chuanhai Liu Author-X-Name-First: Chuanhai Author-X-Name-Last: Liu Title: Correction Abstract: This is to provide corrections to Theorems 1 and 3 in Martin and Liu (2013). The latter correction also casts further light on the role of nested predictive random sets. Journal: Journal of the American Statistical Association Pages: 1138-1139 Issue: 503 Volume: 108 Year: 2013 Month: 9 X-DOI: 10.1080/01621459.2013.796885 File-URL: http://hdl.handle.net/10.1080/01621459.2013.796885 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1138-1139 Template-Type: ReDIF-Article 1.0 Author-Name: Marie Davidian Author-X-Name-First: Marie Author-X-Name-Last: Davidian Title: The International Year of Statistics: A Celebration and A Call to Action Journal: Journal of the American Statistical Association Pages: 1141-1146 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.844019 File-URL: http://hdl.handle.net/10.1080/01621459.2013.844019 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1141-1146 Template-Type: ReDIF-Article 1.0 Author-Name: Blakeley B. McShane Author-X-Name-First: Blakeley B. Author-X-Name-Last: McShane Author-Name: Shane T. Jensen Author-X-Name-First: Shane T. Author-X-Name-Last: Jensen Author-Name: Allan I. Pack Author-X-Name-First: Allan I. Author-X-Name-Last: Pack Author-Name: Abraham J. Wyner Author-X-Name-First: Abraham J. Author-X-Name-Last: Wyner Title: Statistical Learning With Time Series Dependence: An Application to Scoring Sleep in Mice Abstract: We develop methodology that combines statistical learning methods with generalized Markov models, thereby enhancing the former to account for time series dependence. Our methodology can accommodate very general and very long-term time dependence structures in an easily estimable and computationally tractable fashion. We apply our methodology to the scoring of sleep behavior in mice. As methods currently used to score sleep in mice are expensive, invasive, and labor intensive, there is considerable interest in developing high-throughput automated systems which would allow many mice to be scored cheaply and quickly. Previous efforts at automation have been able to differentiate sleep from wakefulness, but they are unable to differentiate the rare and important state of rapid eye movement (REM) sleep from non-REM sleep. Key difficulties in detecting REM are that (i) REM is much rarer than non-REM and wakefulness, (ii) REM looks similar to non-REM in terms of the observed covariates, (iii) the data are noisy, and (iv) the data contain strong time dependence structures crucial for differentiating REM from non-REM. Our new approach (i) shows improved differentiation of REM from non-REM sleep and (ii) accurately estimates aggregate quantities of sleep in our application to video-based sleep scoring of mice. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1147-1162 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.779838 File-URL: http://hdl.handle.net/10.1080/01621459.2013.779838 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1147-1162 Template-Type: ReDIF-Article 1.0 Author-Name: Kerby Shedden Author-X-Name-First: Kerby Author-X-Name-Last: Shedden Title: Comment Journal: Journal of the American Statistical Association Pages: 1162-1163 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.836970 File-URL: http://hdl.handle.net/10.1080/01621459.2013.836970 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1162-1163 Template-Type: ReDIF-Article 1.0 Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Yuanjia Wang Author-X-Name-First: Yuanjia Author-X-Name-Last: Wang Title: Comment Journal: Journal of the American Statistical Association Pages: 1164-1164 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.836971 File-URL: http://hdl.handle.net/10.1080/01621459.2013.836971 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1164-1164 Template-Type: ReDIF-Article 1.0 Author-Name: Blakeley B. McShane Author-X-Name-First: Blakeley B. Author-X-Name-Last: McShane Author-Name: Shane T. Jensen Author-X-Name-First: Shane T. Author-X-Name-Last: Jensen Author-Name: Allan I. Pack Author-X-Name-First: Allan I. Author-X-Name-Last: Pack Author-Name: Abraham J. Wyner Author-X-Name-First: Abraham J. Author-X-Name-Last: Wyner Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 1165-1172 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.844021 File-URL: http://hdl.handle.net/10.1080/01621459.2013.844021 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1165-1172 Template-Type: ReDIF-Article 1.0 Author-Name: Tao Liu Author-X-Name-First: Tao Author-X-Name-Last: Liu Author-Name: Joseph W. Hogan Author-X-Name-First: Joseph W. Author-X-Name-Last: Hogan Author-Name: Lisa Wang Author-X-Name-First: Lisa Author-X-Name-Last: Wang Author-Name: Shangxuan Zhang Author-X-Name-First: Shangxuan Author-X-Name-Last: Zhang Author-Name: Rami Kantor Author-X-Name-First: Rami Author-X-Name-Last: Kantor Title: Optimal Allocation of Gold Standard Testing Under Constrained Availability: Application to Assessment of HIV Treatment Failure Abstract: The World Health Organization (WHO) guidelines for monitoring the effectiveness of human immunodeficiency virus (HIV) treatment in resource-limited settings are mostly based on clinical and immunological markers (e.g., CD4 cell counts). Recent research indicates that the guidelines are inadequate and can result in high error rates. Viral load (VL) is considered the "gold standard," yet its widespread use is limited by cost and infrastructure. In this article, we propose a diagnostic algorithm that uses information from routinely collected clinical and immunological markers to guide a selective use of VL testing for diagnosing HIV treatment failure, under the assumption that VL testing is available only at a certain portion of patient visits. Our algorithm identifies the patient subpopulation, such that the use of limited VL testing on them minimizes a predefined risk (e.g., misdiagnosis error rate). Diagnostic properties of our proposed algorithm are assessed by simulations. For illustration, data from the Miriam Hospital Immunology Clinic (Providence, RI) are analyzed. Journal: Journal of the American Statistical Association Pages: 1173-1188 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.810149 File-URL: http://hdl.handle.net/10.1080/01621459.2013.810149 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1173-1188 Template-Type: ReDIF-Article 1.0 Author-Name: Takahiro Hoshino Author-X-Name-First: Takahiro Author-X-Name-Last: Hoshino Title: Semiparametric Bayesian Estimation for Marginal Parametric Potential Outcome Modeling: Application to Causal Inference Abstract: We propose a new semiparametric Bayesian model for causal inference in which assignment to treatment depends on potential outcomes. The model uses the probit stick-breaking process mixture proposed by Chung and Dunson (2009), a variant of the Dirichlet process mixture modeling. In contrast to previous Bayesian models, the proposed model directly estimates the parameters of the marginal parametric model of potential outcomes, while it relaxes the strong ignorability assumption, and requires no parametric model assumption for the assignment model and conditional distribution of the covariate vector. The proposed estimation method is more robust than maximum likelihood estimation, in that it does not require knowledge of the full joint distribution of potential outcomes, covariates, and assignments. In addition, the method is more efficient than fully nonparametric Bayes methods. We apply this model to infer the differential effects of cognitive and noncognitive skills on the wages of production and nonproduction workers using panel data from the National Longitudinal Survey of Youth in 1979. The study also presents the causal effect of online word-of-mouth on Web site browsing behavior. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1189-1204 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.835656 File-URL: http://hdl.handle.net/10.1080/01621459.2013.835656 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1189-1204 Template-Type: ReDIF-Article 1.0 Author-Name: Malka Gorfine* Author-X-Name-First: Malka Author-X-Name-Last: Gorfine* Author-Name: Li Hsu* Author-X-Name-First: Li Author-X-Name-Last: Hsu* Author-Name: Giovanni Parmigiani Author-X-Name-First: Giovanni Author-X-Name-Last: Parmigiani Title: Frailty Models for Familial Risk With Application to Breast Cancer Abstract: In evaluating familial risk for disease we have two main statistical tasks: assessing the probability of carrying an inherited genetic mutation conferring higher risk, and predicting the absolute risk of developing diseases over time for those individuals whose mutation status is known. Despite substantial progress, much remains unknown about the role of genetic and environmental risk factors, about the sources of variation in risk among families that carry high-risk mutations, and about the sources of familial aggregation beyond major Mendelian effects. These sources of heterogeneity contribute substantial variation in risk across families. In this article we present simple and efficient methods for accounting for this variation in familial risk assessment. Our methods are based on frailty models. We implemented them in the context of generalizing Mendelian models of cancer risk, and compared our approaches to others that do not consider heterogeneity across families. Our extensive simulation study demonstrates that when predicting the risk of developing a disease over time conditional on carrier status, accounting for heterogeneity results in a substantial improvement in the area under the curve of the receiver operating characteristic. On the other hand, the improvement for carriership probability estimation is more limited. We illustrate the utility of the proposed approach through the analysis of BRCA1 and BRCA2 mutation carriers in the Washington Ashkenazi Kin-Cohort Study of Breast Cancer. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1205-1215 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.818001 File-URL: http://hdl.handle.net/10.1080/01621459.2013.818001 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1205-1215 Template-Type: ReDIF-Article 1.0 Author-Name: Huaihou Chen Author-X-Name-First: Huaihou Author-X-Name-Last: Chen Author-Name: Yuanjia Wang Author-X-Name-First: Yuanjia Author-X-Name-Last: Wang Author-Name: Myunghee Cho Paik Author-X-Name-First: Myunghee Cho Author-X-Name-Last: Paik Author-Name: H. Alex Choi Author-X-Name-First: H. Alex Author-X-Name-Last: Choi Title: A Marginal Approach to Reduced-Rank Penalized Spline Smoothing With Application to Multilevel Functional Data Abstract: Multilevel functional data are collected in many biomedical studies. For example, in a study of the effect of Nimodipine on patients with subarachnoid hemorrhage (SAH), patients underwent multiple 4-hr treatment cycles. Within each treatment cycle, subjects' vital signs were reported every 10 min. These data have a natural multilevel structure with treatment cycles nested within subjects and measurements nested within cycles. Most literature on nonparametric analysis of such multilevel functional data focuses on conditional approaches using functional mixed effects models. However, parameters obtained from the conditional models do not have direct interpretations as population average effects. When population effects are of interest, we may employ marginal regression models. In this work, we propose marginal approaches to fit multilevel functional data through penalized spline generalized estimating equation (penalized spline GEE). The procedure is effective for modeling multilevel correlated generalized outcomes as well as continuous outcomes without suffering from numerical difficulties. We provide a variance estimator robust to misspecification of correlation structure. We investigate the large sample properties of the penalized spline GEE estimator with multilevel continuous data and show that the asymptotics falls into two categories. In the small knots scenario, the estimated mean function is asymptotically efficient when the true correlation function is used and the asymptotic bias does not depend on the working correlation matrix. In the large knots scenario, both the asymptotic bias and variance depend on the working correlation. We propose a new method to select the smoothing parameter for penalized spline GEE based on an estimate of the asymptotic mean squared error (MSE). We conduct extensive simulation studies to examine property of the proposed estimator under different correlation structures and sensitivity of the variance estimation to the choice of smoothing parameter. Finally, we apply the methods to the SAH study to evaluate a recent debate on discontinuing the use of Nimodipine in the clinical community. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1216-1229 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.826134 File-URL: http://hdl.handle.net/10.1080/01621459.2013.826134 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1216-1229 Template-Type: ReDIF-Article 1.0 Author-Name: Shane T. Jensen Author-X-Name-First: Shane T. Author-X-Name-Last: Jensen Author-Name: Jared Park Author-X-Name-First: Jared Author-X-Name-Last: Park Author-Name: Alexander F. Braunstein Author-X-Name-First: Alexander F. Author-X-Name-Last: Braunstein Author-Name: Jon Mcauliffe Author-X-Name-First: Jon Author-X-Name-Last: Mcauliffe Title: Bayesian Hierarchical Modeling of the HIV Evolutionary Response to Therapy Abstract: A major challenge for the treatment of human immunodeficiency virus (HIV) infection is the development of therapy-resistant strains. We present a statistical model that quantifies the evolution of HIV populations when exposed to particular therapies. A hierarchical Bayesian approach is used to estimate differences in rates of nucleotide changes between treatment- and control-group sequences. Each group's rates are allowed to vary spatially along the HIV genome. We employ a coalescent structure to address the sequence diversity within the treatment and control HIV populations. We evaluate the model in simulations and estimate HIV evolution in two different applications: a conventional drug therapy and an antisense gene therapy. In both studies, we detect evidence of evolutionary escape response in the HIV population. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1230-1242 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.830449 File-URL: http://hdl.handle.net/10.1080/01621459.2013.830449 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1230-1242 Template-Type: ReDIF-Article 1.0 Author-Name: Matias D. Cattaneo Author-X-Name-First: Matias D. Author-X-Name-Last: Cattaneo Author-Name: Richard K. Crump Author-X-Name-First: Richard K. Author-X-Name-Last: Crump Author-Name: Michael Jansson Author-X-Name-First: Michael Author-X-Name-Last: Jansson Title: Generalized Jackknife Estimators of Weighted Average Derivatives Abstract: With the aim of improving the quality of asymptotic distributional approximations for nonlinear functionals of nonparametric estimators, this article revisits the large-sample properties of an important member of that class, namely a kernel-based weighted average derivative estimator. Asymptotic linearity of the estimator is established under weak conditions. Indeed, we show that the bandwidth conditions employed are necessary in some cases. A bias-corrected version of the estimator is proposed and shown to be asymptotically linear under yet weaker bandwidth conditions. Implementational details of the estimators are discussed, including bandwidth selection procedures. Consistency of an analog estimator of the asymptotic variance is also established. Numerical results from a simulation study and an empirical illustration are reported. To establish the results, a novel result on uniform convergence rates for kernel estimators is obtained. The online supplemental material to this article includes details on the theoretical proofs and other analytic derivations, and further results from the simulation study. Journal: Journal of the American Statistical Association Pages: 1243-1256 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2012.745810 File-URL: http://hdl.handle.net/10.1080/01621459.2012.745810 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1243-1256 Template-Type: ReDIF-Article 1.0 Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Title: Comment Journal: Journal of the American Statistical Association Pages: 1257-1258 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.854172 File-URL: http://hdl.handle.net/10.1080/01621459.2013.854172 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1257-1258 Template-Type: ReDIF-Article 1.0 Author-Name: Holger Dette Author-X-Name-First: Holger Author-X-Name-Last: Dette Title: Comment Journal: Journal of the American Statistical Association Pages: 1258-1260 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.859516 File-URL: http://hdl.handle.net/10.1080/01621459.2013.859516 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1258-1260 Template-Type: ReDIF-Article 1.0 Author-Name: Enno Mammen Author-X-Name-First: Enno Author-X-Name-Last: Mammen Title: Comment Journal: Journal of the American Statistical Association Pages: 1260-1262 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.829000 File-URL: http://hdl.handle.net/10.1080/01621459.2013.829000 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1260-1262 Template-Type: ReDIF-Article 1.0 Author-Name: Xiaohong Chen Author-X-Name-First: Xiaohong Author-X-Name-Last: Chen Title: Comment Journal: Journal of the American Statistical Association Pages: 1262-1264 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.855352 File-URL: http://hdl.handle.net/10.1080/01621459.2013.855352 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1262-1264 Template-Type: ReDIF-Article 1.0 Author-Name: Matias D. Cattaneo Author-X-Name-First: Matias D. Author-X-Name-Last: Cattaneo Author-Name: Richard K. Crump Author-X-Name-First: Richard K. Author-X-Name-Last: Crump Author-Name: Michael Jansson Author-X-Name-First: Michael Author-X-Name-Last: Jansson Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 1265-1268 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.856717 File-URL: http://hdl.handle.net/10.1080/01621459.2013.856717 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1265-1268 Template-Type: ReDIF-Article 1.0 Author-Name: Aurore Delaigle Author-X-Name-First: Aurore Author-X-Name-Last: Delaigle Author-Name: Peter Hall Author-X-Name-First: Peter Author-X-Name-Last: Hall Title: Classification Using Censored Functional Data Abstract: We consider classification of functional data when the training curves are not observed on the same interval. Different types of classifier are suggested, one of which involves a new curve extension procedure. Our approach enables us to exploit the information contained in the endpoints of these intervals by incorporating it in an explicit but flexible way. We study asymptotic properties of our classifiers, and show that, in a variety of settings, they can even produce asymptotically perfect classification. The performance of our techniques is illustrated in applications to real and simulated data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1269-1283 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.824893 File-URL: http://hdl.handle.net/10.1080/01621459.2013.824893 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1269-1283 Template-Type: ReDIF-Article 1.0 Author-Name: Yehua Li Author-X-Name-First: Yehua Author-X-Name-Last: Li Author-Name: Naisyin Wang Author-X-Name-First: Naisyin Author-X-Name-Last: Wang Author-Name: Raymond J. Carroll Author-X-Name-First: Raymond J. Author-X-Name-Last: Carroll Title: Selecting the Number of Principal Components in Functional Data Abstract: Functional principal component analysis (FPCA) has become the most widely used dimension reduction tool for functional data analysis. We consider functional data measured at random, subject-specific time points, contaminated with measurement error, allowing for both sparse and dense functional data, and propose novel information criteria to select the number of principal component in such data. We propose a Bayesian information criterion based on marginal modeling that can consistently select the number of principal components for both sparse and dense functional data. For dense functional data, we also develop an Akaike information criterion based on the expected Kullback--Leibler information under a Gaussian assumption. In connecting with the time series literature, we also consider a class of information criteria proposed for factor analysis of multivariate time series and show that they are still consistent for dense functional data, if a prescribed undersmoothing scheme is undertaken in the FPCA algorithm. We perform intensive simulation studies and show that the proposed information criteria vastly outperform existing methods for this type of data. Surprisingly, our empirical evidence shows that our information criteria proposed for dense functional data also perform well for sparse functional data. An empirical example using colon carcinogenesis data is also provided to illustrate the results. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1284-1294 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.788980 File-URL: http://hdl.handle.net/10.1080/01621459.2013.788980 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1284-1294 Template-Type: ReDIF-Article 1.0 Author-Name: Jingfei Zhang Author-X-Name-First: Jingfei Author-X-Name-Last: Zhang Author-Name: Yuguo Chen Author-X-Name-First: Yuguo Author-X-Name-Last: Chen Title: Sampling for Conditional Inference on Network Data Abstract: Random graphs with given vertex degrees have been widely used as a model for many real-world complex networks. However, both statistical inference and analytic study of such networks present great challenges. In this article, we propose a new sequential importance sampling method for sampling networks with a given degree sequence. These samples can be used to approximate closely the null distributions of a number of test statistics involved in such networks and provide an accurate estimate of the total number of networks with given vertex degrees. We study the asymptotic behavior of the proposed algorithm and prove that the importance weight remains bounded as the size of the graph grows. This property guarantees that the proposed sampling algorithm can still work efficiently even for large sparse graphs. We apply our method to a range of examples to demonstrate its efficiency in real problems. Journal: Journal of the American Statistical Association Pages: 1295-1307 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2012.758587 File-URL: http://hdl.handle.net/10.1080/01621459.2012.758587 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1295-1307 Template-Type: ReDIF-Article 1.0 Author-Name: Lisha Chen Author-X-Name-First: Lisha Author-X-Name-Last: Chen Author-Name: Winston Wei Dou Author-X-Name-First: Winston Wei Author-X-Name-Last: Dou Author-Name: Zhihua Qiao Author-X-Name-First: Zhihua Author-X-Name-Last: Qiao Title: Ensemble Subsampling for Imbalanced Multivariate Two-Sample Tests Abstract: Some existing nonparametric two-sample tests for equality of multivariate distributions perform unsatisfactorily when the two sample sizes are unbalanced. In particular, the power of these tests tends to diminish with increasingly unbalanced sample sizes. In this article, we propose a new testing procedure to solve this problem. The proposed test, based on the nearest neighbor method by Schilling, employs a novel ensemble subsampling scheme to remedy this issue. More specifically, the test statistic is a weighted average of a collection of statistics, each associated with a randomly selected subsample of the data. We derive the asymptotic distribution of the test statistic under the null hypothesis and show that the new test is consistent against all alternatives when the ratio of the sample sizes either goes to a finite limit or tends to infinity. Via simulated data examples we demonstrate that the new test has increasing power with increasing sample size ratio when the size of the smaller sample is fixed. The test is applied to a real-data example in the field of corporate finance. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1308-1323 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.800763 File-URL: http://hdl.handle.net/10.1080/01621459.2013.800763 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1308-1323 Template-Type: ReDIF-Article 1.0 Author-Name: Tsuyoshi Kunihama Author-X-Name-First: Tsuyoshi Author-X-Name-Last: Kunihama Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Bayesian Modeling of Temporal Dependence in Large Sparse Contingency Tables Abstract: It is of interest in many applications to study trends over time in relationships among categorical variables, such as age group, ethnicity, religious affiliation, political party, and preference for particular policies. At each time point, a sample of individuals provides responses to a set of questions, with different individuals sampled at each time. In such settings, there tend to be an abundance of missing data and the variables being measured may change over time. At each time point, we obtained a large sparse contingency table, with the number of cells often much larger than the number of individuals being surveyed. To borrow information across time in modeling large sparse contingency tables, we propose a Bayesian autoregressive tensor factorization approach. The proposed model relies on a probabilistic Parafac factorization of the joint pmf characterizing the categorical data distribution at each time point, with autocorrelation included across times. We develop efficient computational methods that rely on Markov chain Monte Carlo. The methods are evaluated through simulation examples and applied to social survey data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1324-1338 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.823866 File-URL: http://hdl.handle.net/10.1080/01621459.2013.823866 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1324-1338 Template-Type: ReDIF-Article 1.0 Author-Name: Nicholas G. Polson Author-X-Name-First: Nicholas G. Author-X-Name-Last: Polson Author-Name: James G. Scott Author-X-Name-First: James G. Author-X-Name-Last: Scott Author-Name: Jesse Windle Author-X-Name-First: Jesse Author-X-Name-Last: Windle Title: Bayesian Inference for Logistic Models Using Pólya--Gamma Latent Variables Abstract: We propose a new data-augmentation strategy for fully Bayesian inference in models with binomial likelihoods. The approach appeals to a new class of Pólya--Gamma distributions, which are constructed in detail. A variety of examples are presented to show the versatility of the method, including logistic regression, negative binomial regression, nonlinear mixed-effect models, and spatial models for count data. In each case, our data-augmentation strategy leads to simple, effective methods for posterior inference that (1) circumvent the need for analytic approximations, numerical integration, or Metropolis--Hastings; and (2) outperform other known data-augmentation strategies, both in ease of use and in computational efficiency. All methods, including an efficient sampler for the Pólya--Gamma distribution, are implemented in the R package BayesLogit. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1339-1349 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.829001 File-URL: http://hdl.handle.net/10.1080/01621459.2013.829001 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1339-1349 Template-Type: ReDIF-Article 1.0 Author-Name: Wentao Li Author-X-Name-First: Wentao Author-X-Name-Last: Li Author-Name: Zhiqiang Tan Author-X-Name-First: Zhiqiang Author-X-Name-Last: Tan Author-Name: Rong Chen Author-X-Name-First: Rong Author-X-Name-Last: Chen Title: Two-Stage Importance Sampling With Mixture Proposals Abstract: For importance sampling (IS), multiple proposals can be combined to address different aspects of a target distribution. There are various methods for IS with multiple proposals, including Hesterberg's stratified IS estimator, Owen and Zhou's regression estimator, and Tan's maximum likelihood estimator. For the problem of efficiently allocating samples to different proposals, it is natural to use a pilot sample to select the mixture proportions before the actual sampling and estimation. However, most current discussions are in an empirical sense for such a two-stage procedure. In this article, we establish a theoretical framework of applying the two-stage procedure for various methods, including the asymptotic properties and the choice of the pilot sample size. By our simulation studies, these two-stage estimators can outperform estimators with naive choices of mixture proportions. Furthermore, while Owen and Zhou's and Tan's estimators are designed for estimating normalizing constants, we extend their usage and the two-stage procedure to estimating expectations and show that the improvement is still preserved in this extension. Journal: Journal of the American Statistical Association Pages: 1350-1365 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.831980 File-URL: http://hdl.handle.net/10.1080/01621459.2013.831980 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1350-1365 Template-Type: ReDIF-Article 1.0 Author-Name: Jian Zhang Author-X-Name-First: Jian Author-X-Name-Last: Zhang Title: Epistatic Clustering: A Model-Based Approach for Identifying Links Between Clusters Abstract: Most clustering methods assume that the data can be represented by mutually exclusive clusters, although this assumption may not be the case in practice. For example, in gene expression microarray studies, investigators have often found that a gene can play multiple functions in a cell and may, therefore, belong to more than one cluster simultaneously, and that gene clusters can be linked to each other in certain pathways. This article examines the effect of the above assumption on the likelihood of finding latent clusters using theoretical calculations and simulation studies, for which the epistatic structures were known in advance, and on real data analyses. To explore potential links between clusters, we introduce an epistatic mixture model which extends the Gaussian mixture by including epistatic terms. A generalized expectation-maximization (EM) algorithm is developed to compute the related maximum likelihood estimators. The Bayesian information criterion is then used to determine the order of the proposed model. A bootstrap test is proposed for testing whether the epistatic mixture model is a significantly better fit to the data than a standard mixture model in which each data point belongs to one cluster. The asymptotic properties of the proposed estimators are also investigated when the number of analysis units is large. The results demonstrate that the epistatic links between clusters do have a serious effect on the accuracy of clustering and that our epistatic approach can substantially reduce such an effect and improve the fit. Journal: Journal of the American Statistical Association Pages: 1366-1384 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.835661 File-URL: http://hdl.handle.net/10.1080/01621459.2013.835661 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1366-1384 Template-Type: ReDIF-Article 1.0 Author-Name: Sanat K. Sarkar Author-X-Name-First: Sanat K. Author-X-Name-Last: Sarkar Author-Name: Jingjing Chen Author-X-Name-First: Jingjing Author-X-Name-Last: Chen Author-Name: Wenge Guo Author-X-Name-First: Wenge Author-X-Name-Last: Guo Title: Multiple Testing in a Two-Stage Adaptive Design With Combination Tests Controlling FDR Abstract: Testing multiple null hypotheses in two stages to decide which of these can be rejected or accepted at the first stage and which should be followed up for further testing having had additional observations is of importance in many scientific studies. We develop two procedures, each with two different combination functions, Fisher's and Simes', to combine p-values from two stages, given prespecified boundaries on the first-stage p-values in terms of the false discovery rate (FDR) and controlling the overall FDR at a desired level. The FDR control is proved when the pairs of first- and second-stage p-values are independent and those corresponding to the null hypotheses are identically distributed as a pair (p 1, p 2) satisfying the p-clud property. We did simulations to show that (1) our two-stage procedures can have significant power improvements over the first-stage Benjamini--Hochberg (BH) procedure compared to the improvement offered by the ideal BH procedure that one would have used had the second stage data been available for all the hypotheses, and can continue to control the FDR under some dependence situations, and (2) can offer considerable cost savings compared to the ideal BH procedure. The procedures are illustrated through a real gene expression data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1385-1401 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.835662 File-URL: http://hdl.handle.net/10.1080/01621459.2013.835662 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1385-1401 Template-Type: ReDIF-Article 1.0 Author-Name: Luo Lu Author-X-Name-First: Luo Author-X-Name-Last: Lu Author-Name: Hui Jiang Author-X-Name-First: Hui Author-X-Name-Last: Jiang Author-Name: Wing H. Wong Author-X-Name-First: Wing H. Author-X-Name-Last: Wong Title: Multivariate Density Estimation by Bayesian Sequential Partitioning Abstract: Consider a class of densities that are piecewise constant functions over partitions of the sample space defined by sequential coordinate partitioning. We introduce a prior distribution for a density in this function class and derive in closed form the marginal posterior distribution of the corresponding partition. A computationally efficient method, based on sequential importance sampling, is presented for the inference of the partition from this posterior distribution. Compared to traditional approaches such as the kernel method or the histogram, the Bayesian sequential partitioning (BSP) method proposed here is capable of providing much more accurate estimates when the sample space is of moderate to high dimension. We illustrate this by simulated as well as real data examples. The examples also demonstrate how BSP can be used to design new classification methods competitive with the state of the art. Journal: Journal of the American Statistical Association Pages: 1402-1410 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.813389 File-URL: http://hdl.handle.net/10.1080/01621459.2013.813389 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1402-1410 Template-Type: ReDIF-Article 1.0 Author-Name: Min Yang Author-X-Name-First: Min Author-X-Name-Last: Yang Author-Name: Stefanie Biedermann Author-X-Name-First: Stefanie Author-X-Name-Last: Biedermann Author-Name: Elina Tang Author-X-Name-First: Elina Author-X-Name-Last: Tang Title: On Optimal Designs for Nonlinear Models: A General and Efficient Algorithm Abstract: Finding optimal designs for nonlinear models is challenging in general. Although some recent results allow us to focus on a simple subclass of designs for most problems, deriving a specific optimal design still mainly depends on numerical approaches. There is need for a general and efficient algorithm that is more broadly applicable than the current state-of-the-art methods. We present a new algorithm that can be used to find optimal designs with respect to a broad class of optimality criteria, when the model parameters or functions thereof are of interest, and for both locally optimal and multistage design strategies. We prove convergence to the optimal design, and show in various examples that the new algorithm outperforms the current state-of-the-art algorithms. Journal: Journal of the American Statistical Association Pages: 1411-1420 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.806268 File-URL: http://hdl.handle.net/10.1080/01621459.2013.806268 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1411-1420 Template-Type: ReDIF-Article 1.0 Author-Name: Ming-yen Cheng Author-X-Name-First: Ming-yen Author-X-Name-Last: Cheng Author-Name: Hau-tieng Wu Author-X-Name-First: Hau-tieng Author-X-Name-Last: Wu Title: Local Linear Regression on Manifolds and Its Geometric Interpretation Abstract: High-dimensional data analysis has been an active area, and the main focus areas have been variable selection and dimension reduction. In practice, it occurs often that the variables are located on an unknown, lower-dimensional nonlinear manifold. Under this manifold assumption, one purpose of this article is regression and gradient estimation on the manifold, and another is developing a new tool for manifold learning. As regards the first aim, we suggest directly reducing the dimensionality to the intrinsic dimension d of the manifold, and performing the popular local linear regression (LLR) on a tangent plane estimate. An immediate consequence is a dramatic reduction in the computational time when the ambient space dimension p >> d. We provide rigorous theoretical justification of the convergence of the proposed regression and gradient estimators by carefully analyzing the curvature, boundary, and nonuniform sampling effects. We propose a bandwidth selector that can handle heteroscedastic errors. With reference to the second aim, we analyze carefully the asymptotic behavior of our regression estimator both in the interior and near the boundary of the manifold, and make explicit its relationship with manifold learning, in particular estimating the Laplace--Beltrami operator of the manifold. In this context, we also make clear that it is important to use a smaller bandwidth in the tangent plane estimation than in the LLR. A simulation study and applications to the Isomap face data and a clinically computed tomography scan dataset are used to illustrate the computational speed and estimation accuracy of our methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1421-1434 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.827984 File-URL: http://hdl.handle.net/10.1080/01621459.2013.827984 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1421-1434 Template-Type: ReDIF-Article 1.0 Author-Name: Shelby J. Haberman Author-X-Name-First: Shelby J. Author-X-Name-Last: Haberman Author-Name: Sandip Sinharay Author-X-Name-First: Sandip Author-X-Name-Last: Sinharay Title: Generalized Residuals for General Models for Contingency Tables With Application to Item Response Theory Abstract: Generalized residuals are a tool employed in the analysis of contingency tables to examine possible sources of model error. They have typically been applied to log-linear models and to latent-class models. A general approach to generalized residuals is developed for a very general class of models for contingency tables. To illustrate their use, generalized residuals are applied to models based on item response theory (IRT) models. Such models are commonly applied to analysis of standardized achievement or aptitude tests. To obtain a realistic perspective on application of generalized residuals, actual testing data are employed. Journal: Journal of the American Statistical Association Pages: 1435-1444 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.835660 File-URL: http://hdl.handle.net/10.1080/01621459.2013.835660 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1435-1444 Template-Type: ReDIF-Article 1.0 Author-Name: Bin Zhu Author-X-Name-First: Bin Author-X-Name-Last: Zhu Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Locally Adaptive Bayes Nonparametric Regression via Nested Gaussian Processes Abstract: We propose a nested Gaussian process (nGP) as a locally adaptive prior for Bayesian nonparametric regression. Specified through a set of stochastic differential equations (SDEs), the nGP imposes a Gaussian process prior for the function's mth-order derivative. The nesting comes in through including a local instantaneous mean function, which is drawn from another Gaussian process inducing adaptivity to locally varying smoothness. We discuss the support of the nGP prior in terms of the closure of a reproducing kernel Hilbert space, and consider theoretical properties of the posterior. The posterior mean under the nGP prior is shown to be equivalent to the minimizer of a nested penalized sum-of-squares involving penalties for both the global and local roughness of the function. Using highly efficient Markov chain Monte Carlo for posterior inference, the proposed method performs well in simulation studies compared to several alternatives, and is scalable to massive data, illustrated through a proteomics application. Journal: Journal of the American Statistical Association Pages: 1445-1456 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.838568 File-URL: http://hdl.handle.net/10.1080/01621459.2013.838568 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1445-1456 Template-Type: ReDIF-Article 1.0 Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Author-Name: Jing Cheng Author-X-Name-First: Jing Author-X-Name-Last: Cheng Author-Name: M. Elizabeth Halloran Author-X-Name-First: M. Elizabeth Author-X-Name-Last: Halloran Author-Name: Paul R. Rosenbaum Author-X-Name-First: Paul R. Author-X-Name-Last: Rosenbaum Title: Case Definition and Design Sensitivity Abstract: In a case-referent study, cases of disease are compared to noncases with respect to their antecedent exposure to a treatment in an effort to determine whether exposure causes some cases of the disease. Because exposure is not randomly assigned in the population, as it would be if the population were a vast randomized trial, exposed and unexposed subjects may differ prior to exposure with respect to covariates that may or may not have been measured. After controlling for measured preexposure differences, for instance by matching, a sensitivity analysis asks about the magnitude of bias from unmeasured covariates that would need to be present to alter the conclusions of a study that presumed matching for observed covariates removes all bias. The definition of a case of disease affects sensitivity to unmeasured bias. We explore this issue using: (i) an asymptotic tool, the design sensitivity, (ii) a simulation for finite samples, and (iii) an example. Under favorable circumstances, a narrower case definition can yield an increase in the design sensitivity, and hence an increase in the power of a sensitivity analysis. Also, we discuss an adaptive method that seeks to discover the best case definition from the data at hand while controlling for multiple testing. An implementation in R is available as SensitivityCaseControl. Journal: Journal of the American Statistical Association Pages: 1457-1468 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.820660 File-URL: http://hdl.handle.net/10.1080/01621459.2013.820660 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1457-1468 Template-Type: ReDIF-Article 1.0 Author-Name: Ying Hung Author-X-Name-First: Ying Author-X-Name-Last: Hung Author-Name: Yijie Wang Author-X-Name-First: Yijie Author-X-Name-Last: Wang Author-Name: Veronika Zarnitsyna Author-X-Name-First: Veronika Author-X-Name-Last: Zarnitsyna Author-Name: Cheng Zhu Author-X-Name-First: Cheng Author-X-Name-Last: Zhu Author-Name: C. F. Jeff Wu Author-X-Name-First: C. F. Jeff Author-X-Name-Last: Wu Title: Hidden Markov Models With Applications in Cell Adhesion Experiments Abstract: Estimation of the number of hidden states is challenging in hidden Markov models. Motivated by the analysis of a specific type of cell adhesion experiments, a new framework based on a hidden Markov model and double penalized order selection is proposed. The order selection procedure is shown to be consistent in estimating the number of states. A modified expectation--maximization algorithm is introduced to efficiently estimate parameters in the model. Simulations show that the proposed framework outperforms existing methods. Applications of the proposed methodology to real data demonstrate the accuracy of estimating receptor--ligand bond lifetimes and waiting times which are essential in kinetic parameter estimation. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1469-1479 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.836973 File-URL: http://hdl.handle.net/10.1080/01621459.2013.836973 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1469-1479 Template-Type: ReDIF-Article 1.0 Author-Name: Marina Bogomolov Author-X-Name-First: Marina Author-X-Name-Last: Bogomolov Author-Name: Ruth Heller Author-X-Name-First: Ruth Author-X-Name-Last: Heller Title: Discovering Findings That Replicate From a Primary Study of High Dimension to a Follow-Up Study Abstract: We consider the problem of identifying whether findings replicate from one study of high dimension to another, when the primary study guides the selection of hypotheses to be examined in the follow-up study as well as when there is no division of roles into the primary and the follow-up study. We show that existing meta-analysis methods are not appropriate for this problem, and suggest novel methods instead. We prove that our multiple testing procedures control for appropriate error rates. The suggested family-wise error rate controlling procedure is valid for arbitrary dependence among the test statistics within each study. A more powerful procedure is suggested for false discovery rate (FDR) control. We prove that this procedure controls the FDR if the test statistics are independent within the primary study, and independent or have positive dependence in the follow-up study. For arbitrary dependence within the primary study, and either arbitrary dependence or positive dependence in the follow-up study, simple conservative modifications of the procedure control the FDR. We demonstrate the usefulness of these procedures via simulations and real data examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1480-1492 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.829002 File-URL: http://hdl.handle.net/10.1080/01621459.2013.829002 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1480-1492 Template-Type: ReDIF-Article 1.0 Author-Name: Li Ma Author-X-Name-First: Li Author-X-Name-Last: Ma Title: Adaptive Testing of Conditional Association Through Recursive Mixture Modeling Abstract: In many case-control studies, a central goal is to test for association or dependence between the predictors and the response. Relevant covariates must be conditioned on to avoid false positives and loss in power. Conditioning on covariates is easy in parametric frameworks such as the logistic regression-by incorporating the covariates into the model as additional variables. In contrast, nonparametric methods such as the Cochran-Mantel-Haenszel test accomplish conditioning by dividing the data into strata, one for each possible covariate value. In modern applications, this often gives rise to numerous strata, most of which are sparse due to the multidimensionality of the covariate and/or predictor space, while in reality, the covariate space often consists of just a small number of subsets with differential response-predictor dependence. We introduce a Bayesian approach to inferring from the data such an effective stratification and testing for association accordingly. The core of our framework is a recursive mixture model on the retrospective distribution of the predictors, whose mixing distribution is a prior on the partitions on the covariate space. Inference under the model can proceed efficiently in closed form through a sequence of recursions, striking a balance between model flexibility and computational tractability. Simulation studies show that our method substantially outperforms classical tests under various scenarios. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1493-1505 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.838899 File-URL: http://hdl.handle.net/10.1080/01621459.2013.838899 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1493-1505 Template-Type: ReDIF-Article 1.0 Author-Name: Young Min Kim Author-X-Name-First: Young Min Author-X-Name-Last: Kim Author-Name: Soumendra N. Lahiri Author-X-Name-First: Soumendra N. Author-X-Name-Last: Lahiri Author-Name: Daniel J. Nordman Author-X-Name-First: Daniel J. Author-X-Name-Last: Nordman Title: A Progressive Block Empirical Likelihood Method for Time Series Abstract: This article develops a new blockwise empirical likelihood (BEL) method for stationary, weakly dependent time processes, called the progressive block empirical likelihood (PBEL). In contrast to the standard version of BEL, which uses data blocks of constant length for a given sample size and whose performance can depend crucially on the block length selection, this new approach involves a data-blocking scheme where blocks increase in length by an arithmetic progression. Consequently, no block length selections are required for the PBEL method, which implies a certain type of robustness for this version of BEL. For inference of smooth functions of the process mean, theoretical results establish the chi-squared limit of the log-likelihood ratio based on PBEL, which can be used to calibrate confidence regions. Using the same progressive block scheme, distributional extensions are also provided for other nonparametric likelihoods with time series in the family of Cressie--Read discrepancies. Simulation evidence indicates that the PBEL method can perform comparably to the standard BEL in coverage accuracy (when the latter uses a "good" block choice) and can exhibit more stability, without the need to select a usual block length. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1506-1516 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.847374 File-URL: http://hdl.handle.net/10.1080/01621459.2013.847374 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1506-1516 Template-Type: ReDIF-Article 1.0 Author-Name: Yuanshan Wu Author-X-Name-First: Yuanshan Author-X-Name-Last: Wu Author-Name: Guosheng Yin Author-X-Name-First: Guosheng Author-X-Name-Last: Yin Title: Cure Rate Quantile Regression for Censored Data With a Survival Fraction Abstract: Censored quantile regression offers a valuable complement to the traditional Cox proportional hazards model for survival analysis. Survival times tend to be right-skewed, particularly when there exists a substantial fraction of long-term survivors who are either cured or immune to the event of interest. For survival data with a cure possibility, we propose cure rate quantile regression under the common censoring scheme that survival times and censoring times are conditionally independent given the covariates. In a mixture formulation, we apply censored quantile regression to model the survival times of susceptible subjects and logistic regression to model the indicators of whether patients are susceptible. We develop two estimation methods using martingale-based equations: One approach fully uses all regression quantiles by iterating estimation between the cure rate and quantile regression parameters; and the other separates the two via a nonparametric kernel smoothing estimator. We establish the uniform consistency and weak convergence properties for the estimators obtained from both methods. The proposed model is evaluated through extensive simulation studies and illustrated with a bone marrow transplantation data example. Technical proofs of key theorems are given in Appendices A, B, and C, while those of lemmas and additional simulation studies on model misspecification and comparisons with other models are provided in the online Supplementary Materials A and B. Journal: Journal of the American Statistical Association Pages: 1517-1531 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.837368 File-URL: http://hdl.handle.net/10.1080/01621459.2013.837368 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1517-1531 Template-Type: ReDIF-Article 1.0 Author-Name: Tianxi Cai Author-X-Name-First: Tianxi Author-X-Name-Last: Cai Author-Name: Yingye Zheng Author-X-Name-First: Yingye Author-X-Name-Last: Zheng Title: Resampling Procedures for Making Inference Under Nested Case--Control Studies Abstract: The nested case--control (NCC) design has been widely adopted as a cost-effective solution in many large cohort studies for risk assessment with expensive markers, such as the emerging biologic and genetic markers. To analyze data from NCC studies, conditional logistic regression and maximum likelihood-based methods have been proposed. However, most of these methods either cannot be easily extended beyond the Cox model or require additional modeling assumptions. More generally applicable approaches based on inverse probability weighting (IPW) have been proposed as useful alternatives. However, due to the complex correlation structure induced by repeated finite risk set sampling, interval estimation for such IPW estimators remain challenging especially when the estimation involves nonsmooth objective functions or when making simultaneous inferences about functions. Standard resampling procedures such as the bootstrap cannot accommodate the correlation and thus are not directly applicable. In this article, we propose a resampling procedure that can provide valid estimates for the distribution of a broad class of IPW estimators. Simulation results suggest that the proposed procedures perform well in settings when analytical variance estimator is infeasible to derive or gives less optimal performance. The new procedures are illustrated with data from the Framingham Offspring Study to characterize individual level cardiovascular risks over time based on the Framingham risk score, C-reactive protein, and a genetic risk score. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1532-1544 Issue: 504 Volume: 108 Year: 2013 Month: 12 X-DOI: 10.1080/01621459.2013.856715 File-URL: http://hdl.handle.net/10.1080/01621459.2013.856715 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1532-1544 Template-Type: ReDIF-Article 1.0 Author-Name: Luo Xiao Author-X-Name-First: Luo Author-X-Name-Last: Xiao Author-Name: Sally W. Thurston Author-X-Name-First: Sally W. Author-X-Name-Last: Thurston Author-Name: David Ruppert Author-X-Name-First: David Author-X-Name-Last: Ruppert Author-Name: Tanzy M. T. Love Author-X-Name-First: Tanzy M. T. Author-X-Name-Last: Love Author-Name: Philip W. Davidson Author-X-Name-First: Philip W. Author-X-Name-Last: Davidson Title: Bayesian Models for Multiple Outcomes in Domains With Application to the Seychelles Child Development Study Abstract: The Seychelles Child Development Study (SCDS) examines the effects of prenatal exposure to methylmercury on the functioning of the central nervous system. The SCDS data include 20 outcomes measured on 9-year-old children that can be classified broadly in four outcome classes or "domains": cognition, memory, motor, and social behavior. Previous analyses and scientific theory suggest that these outcomes may belong to more than one of these domains, rather than only a single domain as is frequently assumed for modeling. We present a framework for examining the effects of exposure and other covariates when the outcomes may each belong to more than one domain and where we also want to learn about the assignment of outcomes to domains. Each domain is defined by a sentinel outcome, which is preassigned to that domain only. All other outcomes can belong to multiple domains and are not preassigned. Our model allows exposure and covariate effects to differ across domains and across outcomes within domains, and includes random subject-specific effects that model correlations between outcomes within and across domains. We take a Bayesian MCMC approach. Results from the Seychelles study and from extensive simulations show that our model can effectively determine sparse domain assignment, and at the same time give increased power to detect overall, domain-specific, and outcome-specific exposure and covariate effects relative to separate models for each endpoint. When fit to the Seychelles data, several outcomes were classified as partly belonging to domains other than their originally assigned domains. In retrospect, the new partial domain assignments are reasonable and, as we discuss, suggest important scientific insights about the nature of the outcomes. Checks of model misspecification were improved relative to a model that assumes each outcome is in a single domain. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1-10 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.830070 File-URL: http://hdl.handle.net/10.1080/01621459.2013.830070 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:1-10 Template-Type: ReDIF-Article 1.0 Author-Name: Hui Huang Author-X-Name-First: Hui Author-X-Name-Last: Huang Author-Name: Xiaomei Ma Author-X-Name-First: Xiaomei Author-X-Name-Last: Ma Author-Name: Rasmus Waagepetersen Author-X-Name-First: Rasmus Author-X-Name-Last: Waagepetersen Author-Name: Theodore R. Holford Author-X-Name-First: Theodore R. Author-X-Name-Last: Holford Author-Name: Rong Wang Author-X-Name-First: Rong Author-X-Name-Last: Wang Author-Name: Harvey Risch Author-X-Name-First: Harvey Author-X-Name-Last: Risch Author-Name: Lloyd Mueller Author-X-Name-First: Lloyd Author-X-Name-Last: Mueller Author-Name: Yongtao Guan Author-X-Name-First: Yongtao Author-X-Name-Last: Guan Title: A New Estimation Approach for Combining Epidemiological Data From Multiple Sources Abstract: We propose a novel two-step procedure to combine epidemiological data obtained from diverse sources with the aim to quantify risk factors affecting the probability that an individual develops certain disease such as cancer. In the first step, we derive all possible unbiased estimating functions based on a group of cases and a group of controls each time. In the second step, we combine these estimating functions efficiently to make full use of the information contained in data. Our approach is computationally simple and flexible. We illustrate its efficacy through simulation and apply it to investigate pancreatic cancer risks based on data obtained from the Connecticut Tumor Registry, a population-based case--control study, and the Behavioral Risk Factor Surveillance System which is a state-based system of health surveys. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 11-23 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.870904 File-URL: http://hdl.handle.net/10.1080/01621459.2013.870904 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:11-23 Template-Type: ReDIF-Article 1.0 Author-Name: Marco Carone Author-X-Name-First: Marco Author-X-Name-Last: Carone Author-Name: Masoud Asgharian Author-X-Name-First: Masoud Author-X-Name-Last: Asgharian Author-Name: Nicholas P. Jewell Author-X-Name-First: Nicholas P. Author-X-Name-Last: Jewell Title: Estimating the Lifetime Risk of Dementia in the Canadian Elderly Population Using Cross-Sectional Cohort Survival Data Abstract: Dementia is one of the world's major public health challenges. The lifetime risk of dementia is the proportion of individuals who ever develop dementia during their lifetime. Despite its importance to epidemiologists and policy-makers, this measure does not seem to have been estimated in the Canadian population. Data from a birth cohort study of dementia are not available. Instead, we must rely on data from the Canadian Study of Health and Aging, a large cross-sectional study of dementia with follow-up for survival. These data present challenges because they include substantial loss to follow-up and are not representatively drawn from the target population because of structural sampling biases. A first bias is imparted by the cross-sectional sampling scheme, while a second bias is a result of stratified sampling. Estimation of the lifetime risk and related quantities in the presence of these biases has not been previously addressed in the literature. We develop and study nonparametric estimators of the lifetime risk, the remaining lifetime risk, and cumulative risk at specific ages, accounting for these complexities. In particular, we reveal the fact that estimation of the lifetime risk is invariant to stratification by current age at sampling. We present simulation results validating our methodology, and provide novel facts about the epidemiology of dementia in Canada using data from the Canadian Study of Health and Aging. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 24-35 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.859076 File-URL: http://hdl.handle.net/10.1080/01621459.2013.859076 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:24-35 Template-Type: ReDIF-Article 1.0 Author-Name: Patrick M. Joyce Author-X-Name-First: Patrick M. Author-X-Name-Last: Joyce Author-Name: Donald Malec Author-X-Name-First: Donald Author-X-Name-Last: Malec Author-Name: Roderick J. A. Little Author-X-Name-First: Roderick J. A. Author-X-Name-Last: Little Author-Name: Aaron Gilary Author-X-Name-First: Aaron Author-X-Name-Last: Gilary Author-Name: Alfredo Navarro Author-X-Name-First: Alfredo Author-X-Name-Last: Navarro Author-Name: Mark E. Asiala Author-X-Name-First: Mark E. Author-X-Name-Last: Asiala Title: Statistical Modeling Methodology for the Voting Rights Act Section 203 Language Assistance Determinations Abstract: Section 203 of the Voting Rights Act includes provisions requiring the use of election materials in languages other than English for states or political subdivisions, specifically, when a minimum number of voting age U.S. citizens of specified language minority groups who are unable to speak English very well and have obtained less than a fifth-grade education is met. Data on these characteristics are provided by the 2010 Census and the American Community Survey (ACS), a general purpose sample survey designed to produce a large volume of estimates across the spectrum of the nation's geographic areas and subgroups of the population. This article describes the small-area model and the estimation methods that were developed and applied to create the list of 2011 political subdivisions that were subject to the provisions. Journal: Journal of the American Statistical Association Pages: 36-47 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.859077 File-URL: http://hdl.handle.net/10.1080/01621459.2013.859077 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:36-47 Template-Type: ReDIF-Article 1.0 Author-Name: Liqun Xi Author-X-Name-First: Liqun Author-X-Name-Last: Xi Author-Name: Kristin Brogaard Author-X-Name-First: Kristin Author-X-Name-Last: Brogaard Author-Name: Qingyang Zhang Author-X-Name-First: Qingyang Author-X-Name-Last: Zhang Author-Name: Bruce Lindsay Author-X-Name-First: Bruce Author-X-Name-Last: Lindsay Author-Name: Jonathan Widom Author-X-Name-First: Jonathan Author-X-Name-Last: Widom Author-Name: Ji-Ping Wang Author-X-Name-First: Ji-Ping Author-X-Name-Last: Wang Title: A Locally Convoluted Cluster Model for Nucleosome Positioning Signals in Chemical Maps Abstract: The nucleosome is the fundamental packing unit of DNA in eukaryotic cells, and its positioning plays a critical role in regulation of gene expression and chromosome functions. Using a recently developed chemical mapping method, nucleosomes can be potentially mapped with an unprecedented single-base-pair resolution. Existence of overlapping nucleosomes due to cell mixture or cell dynamics, however, causes convolution of nucleosome positioning signals. In this article, we introduce a locally convoluted cluster model and a maximum likelihood deconvolution approach, and illustrate the effectiveness of this approach in quantification of the nucleosome positional signal in the chemical mapping data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 48-62 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.862169 File-URL: http://hdl.handle.net/10.1080/01621459.2013.862169 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:48-62 Template-Type: ReDIF-Article 1.0 Author-Name: Lucas Janson Author-X-Name-First: Lucas Author-X-Name-Last: Janson Author-Name: Bala Rajaratnam Author-X-Name-First: Bala Author-X-Name-Last: Rajaratnam Title: A Methodology for Robust Multiproxy Paleoclimate Reconstructions and Modeling of Temperature Conditional Quantiles Abstract: Great strides have been made in the field of reconstructing past temperatures based on models relating temperature to temperature-sensitive paleoclimate proxies. One of the goals of such reconstructions is to assess if current climate is anomalous in a millennial context. These regression-based approaches model the conditional mean of the temperature distribution as a function of paleoclimate proxies (or vice versa). Some of the recent focus in the area has considered methods that help reduce the uncertainty inherent in such statistical paleoclimate reconstructions, with the ultimate goal of improving the confidence that can be attached to such endeavors. A second important scientific focus in the subject area is the area of forward models for proxies, the goal of which is to understand the way paleoclimate proxies are driven by temperature and other environmental variables. One of the primary contributions of this article is novel statistical methodology for (i) quantile regression (QR) with autoregressive residual structure, (ii) estimation of corresponding model parameters, (iii) development of a rigorous framework for specifying uncertainty estimates of quantities of interest, yielding (iv) statistical byproducts that address the two scientific foci discussed above. We show that by using the above statistical methodology, we can demonstrably produce a more robust reconstruction than is possible by using conditional-mean-fitting methods. Our reconstruction shares some of the common features of past reconstructions, but we also gain useful insights. More importantly, we are able to demonstrate a significantly smaller uncertainty than that from previous regression methods. In addition, the QR component allows us to model, in a more complete and flexible way than least squares, the conditional distribution of temperature given proxies. This relationship can be used to inform forward models relating how proxies are driven by temperature. Journal: Journal of the American Statistical Association Pages: 63-77 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.848807 File-URL: http://hdl.handle.net/10.1080/01621459.2013.848807 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:63-77 Template-Type: ReDIF-Article 1.0 Author-Name: Naim Rashid Author-X-Name-First: Naim Author-X-Name-Last: Rashid Author-Name: Wei Sun Author-X-Name-First: Wei Author-X-Name-Last: Sun Author-Name: Joseph G. Ibrahim Author-X-Name-First: Joseph G. Author-X-Name-Last: Ibrahim Title: Some Statistical Strategies for DAE-seq Data Analysis: Variable Selection and Modeling Dependencies Among Observations Abstract: In DAE (DNA after enrichment)-seq experiments, genomic regions related with certain biological processes are enriched/isolated by an assay and are then sequenced on a high-throughput sequencing platform to determine their genomic positions. Statistical analysis of DAE-seq data aims to detect genomic regions with significant aggregations of isolated DNA fragments ("enriched regions") versus all the other regions ("background"). However, many confounding factors may influence DAE-seq signals. In addition, the signals in adjacent genomic regions may exhibit strong correlations, which invalidate the independence assumption employed by many existing methods. To mitigate these issues, we develop a novel autoregressive Hidden Markov model (AR-HMM) to account for covariates effects and violations of the independence assumption. We demonstrate that our AR-HMM leads to improved performance in identifying enriched regions in both simulated and real datasets, especially in those in epigenetic datasets with broader regions of DAE-seq signal enrichment. We also introduce a variable selection procedure in the context of the HMM/AR-HMM where the observations are not independent and the mean value of each state-specific emission distribution is modeled by some covariates. We study the theoretical properties of this variable selection procedure and demonstrate its efficacy in simulated and real DAE-seq data. In summary, we develop several practical approaches for DAE-seq data analysis that are also applicable to more general problems in statistics. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 78-94 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.869222 File-URL: http://hdl.handle.net/10.1080/01621459.2013.869222 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:78-94 Template-Type: ReDIF-Article 1.0 Author-Name: Corwin Matthew Zigler Author-X-Name-First: Corwin Matthew Author-X-Name-Last: Zigler Author-Name: Francesca Dominici Author-X-Name-First: Francesca Author-X-Name-Last: Dominici Title: Uncertainty in Propensity Score Estimation: Bayesian Methods for Variable Selection and Model-Averaged Causal Effects Abstract: Causal inference with observational data frequently relies on the notion of the propensity score (PS) to adjust treatment comparisons for observed confounding factors. As decisions in the era of "big data" are increasingly reliant on large and complex collections of digital data, researchers are frequently confronted with decisions regarding which of a high-dimensional covariate set to include in the PS model to satisfy the assumptions necessary for estimating average causal effects. Typically, simple or ad hoc methods are employed to arrive at a single PS model, without acknowledging the uncertainty associated with the model selection. We propose three Bayesian methods for PS variable selection and model averaging that (a) select relevant variables from a set of candidate variables to include in the PS model and (b) estimate causal treatment effects as weighted averages of estimates under different PS models. The associated weight for each PS model reflects the data-driven support for that model's ability to adjust for the necessary variables. We illustrate features of our proposed approaches with a simulation study, and ultimately use our methods to compare the effectiveness of surgical versus nonsurgical treatment for brain tumors among 2606 Medicare beneficiaries. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 95-107 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.869498 File-URL: http://hdl.handle.net/10.1080/01621459.2013.869498 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:95-107 Template-Type: ReDIF-Article 1.0 Author-Name: Ziyue Liu Author-X-Name-First: Ziyue Author-X-Name-Last: Liu Author-Name: Anne R. Cappola Author-X-Name-First: Anne R. Author-X-Name-Last: Cappola Author-Name: Leslie J. Crofford Author-X-Name-First: Leslie J. Author-X-Name-Last: Crofford Author-Name: Wensheng Guo Author-X-Name-First: Wensheng Author-X-Name-Last: Guo Title: Modeling Bivariate Longitudinal Hormone Profiles by Hierarchical State Space Models Abstract: The hypothalamic-pituitary-adrenal (HPA) axis is crucial in coping with stress and maintaining homeostasis. Hormones produced by the HPA axis exhibit both complex univariate longitudinal profiles and complex relationships among different hormones. Consequently, modeling these multivariate longitudinal hormone profiles is a challenging task. In this article, we propose a bivariate hierarchical state space model, in which each hormone profile is modeled by a hierarchical state space model, with both population-average and subject-specific components. The bivariate model is constructed by concatenating the univariate models based on the hypothesized relationship. Because of the flexible framework of state space form, the resultant models not only can handle complex individual profiles, but also can incorporate complex relationships between two hormones, including both concurrent and feedback relationship. Estimation and inference are based on marginal likelihood and posterior means and variances. Computationally efficient Kalman filtering and smoothing algorithms are used for implementation. Application of the proposed method to a study of chronic fatigue syndrome and fibromyalgia reveals that the relationships between adrenocorticotropic hormone and cortisol in the patient group are weaker than in healthy controls. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 108-118 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.830071 File-URL: http://hdl.handle.net/10.1080/01621459.2013.830071 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:108-118 Template-Type: ReDIF-Article 1.0 Author-Name: Yinshan Zhao Author-X-Name-First: Yinshan Author-X-Name-Last: Zhao Author-Name: David K. B. Li Author-X-Name-First: David K. B. Author-X-Name-Last: Li Author-Name: A. John Petkau Author-X-Name-First: A. John Author-X-Name-Last: Petkau Author-Name: Andrew Riddehough Author-X-Name-First: Andrew Author-X-Name-Last: Riddehough Author-Name: Anthony Traboulsee Author-X-Name-First: Anthony Author-X-Name-Last: Traboulsee Title: Detection of Unusual Increases in MRI Lesion Counts in Individual Multiple Sclerosis Patients Abstract: Data Safety and Monitoring Boards (DSMBs) for multiple sclerosis clinical trials consider an increase of contrast-enhancing lesions on repeated magnetic resonance imaging an indicator for potential adverse events. However, there are no published studies that clearly identify what should be considered an "unexpected increase" of lesion activity for a patient. To address this problem, we consider as an index the likelihood of observing lesion counts as large as those observed on the recent scans of a patient conditional on the patient's lesion counts on previous scans. To estimate this index, we rely on random effects models. Given the patient-specific random effect, we assume that the repeated lesion counts from the same patient follow a negative binomial distribution and may be correlated over time. We fit the model using data collected from the trial under DSMB review and update the estimation when new data are to be reviewed. We consider two estimation procedures: maximum likelihood for a fully parameterized model and a simple semiparametric method for a model with an unspecified distribution for the random effects. We examine the performance of our methods using simulations and illustrate the approach using data from a clinical trial. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 119-132 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.847373 File-URL: http://hdl.handle.net/10.1080/01621459.2013.847373 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:119-132 Template-Type: ReDIF-Article 1.0 Author-Name: Ben B. Hansen Author-X-Name-First: Ben B. Author-X-Name-Last: Hansen Author-Name: Paul R. Rosenbaum Author-X-Name-First: Paul R. Author-X-Name-Last: Rosenbaum Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Title: Clustered Treatment Assignments and Sensitivity to Unmeasured Biases in Observational Studies Abstract: Clustered treatment assignment occurs when individuals are grouped into clusters prior to treatment and whole clusters, not individuals, are assigned to treatment or control. In randomized trials, clustered assignments may be required because the treatment must be applied to all children in a classroom, or to all patients at a clinic, or to all radio listeners in the same media market. The most common cluster randomized design pairs 2S clusters into S pairs based on similar pretreatment covariates, then picks one cluster in each pair at random for treatment, the other cluster being assigned to control. Typically, group randomization increases sampling variability and so is less efficient, less powerful, than randomization at the individual level, but it may be unavoidable when it is impractical to treat just a few people within each cluster. Related issues arise in nonrandomized, observational studies of treatment effects, but in this case one must examine the sensitivity of conclusions to bias from nonrandom selection of clusters for treatment. Although clustered assignment increases sampling variability in observational studies, as it does in randomized experiments, it also tends to decrease sensitivity to unmeasured biases, and as the number of cluster pairs increases the latter effect overtakes the former, dominating it when allowance is made for nontrivial biases in treatment assignment. Intuitively, a given magnitude of departure from random assignment can do more harm if it acts on individual students than if it is restricted to act on whole classes, because the bias is unable to pick the strongest individual students for treatment, and this is especially true if a serious effort is made to pair clusters that appeared similar prior to treatment. We examine this issue using an asymptotic measure, the design sensitivity, some inequalities that exploit convexity, simulation, and an application concerned with the flooding of villages in Bangladesh. Journal: Journal of the American Statistical Association Pages: 133-144 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.863157 File-URL: http://hdl.handle.net/10.1080/01621459.2013.863157 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:133-144 Template-Type: ReDIF-Article 1.0 Author-Name: Genevera I. Allen Author-X-Name-First: Genevera I. Author-X-Name-Last: Allen Author-Name: Logan Grosenick Author-X-Name-First: Logan Author-X-Name-Last: Grosenick Author-Name: Jonathan Taylor Author-X-Name-First: Jonathan Author-X-Name-Last: Taylor Title: A Generalized Least-Square Matrix Decomposition Abstract: Variables in many big-data settings are structured, arising, for example, from measurements on a regular grid as in imaging and time series or from spatial-temporal measurements as in climate studies. Classical multivariate techniques ignore these structural relationships often resulting in poor performance. We propose a generalization of principal components analysis (PCA) that is appropriate for massive datasets with structured variables or known two-way dependencies. By finding the best low-rank approximation of the data with respect to a transposable quadratic norm, our decomposition, entitled the generalized least-square matrix decomposition (GMD), directly accounts for structural relationships. As many variables in high-dimensional settings are often irrelevant, we also regularize our matrix decomposition by adding two-way penalties to encourage sparsity or smoothness. We develop fast computational algorithms using our methods to perform generalized PCA (GPCA), sparse GPCA, and functional GPCA on massive datasets. Through simulations and a whole brain functional MRI example, we demonstrate the utility of our methodology for dimension reduction, signal recovery, and feature selection with high-dimensional structured data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 145-159 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.852978 File-URL: http://hdl.handle.net/10.1080/01621459.2013.852978 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:145-159 Template-Type: ReDIF-Article 1.0 Author-Name: François Portier Author-X-Name-First: François Author-X-Name-Last: Portier Author-Name: Bernard Delyon Author-X-Name-First: Bernard Author-X-Name-Last: Delyon Title: Bootstrap Testing of the Rank of a Matrix via Least-Squared Constrained Estimation Abstract: To test if an unknown matrix M 0 has a given rank (null hypothesis noted H 0), we consider a statistic that is a squared distance between an estimator and the submanifold of fixed-rank matrix. Under H 0, this statistic converges to a weighted chi-squared distribution. We introduce the constrained bootstrap (CS bootstrap) to estimate the law of this statistic under H 0. An important point is that even if H 0 fails, the CS bootstrap reproduces the behavior of the statistic under H 0. As a consequence, the CS bootstrap is employed to estimate the nonasymptotic quantile for testing the rank. We provide the consistency of the procedure and the simulations shed light on the accuracy of the CS bootstrap with respect to the traditional asymptotic comparison. More generally, the results are extended to test whether an unknown parameter belongs to a submanifold of the Euclidean space. Finally, the CS bootstrap is easy to compute, it handles a large family of tests and it works under mild assumptions. Journal: Journal of the American Statistical Association Pages: 160-172 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.847841 File-URL: http://hdl.handle.net/10.1080/01621459.2013.847841 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:160-172 Template-Type: ReDIF-Article 1.0 Author-Name: Nicolas J-B. Brunel Author-X-Name-First: Nicolas J-B. Author-X-Name-Last: Brunel Author-Name: Quentin Clairon Author-X-Name-First: Quentin Author-X-Name-Last: Clairon Author-Name: Florence d'Alché-Buc Author-X-Name-First: Florence Author-X-Name-Last: d'Alché-Buc Title: Parametric Estimation of Ordinary Differential Equations With Orthogonality Conditions Abstract: Differential equations are commonly used to model dynamical deterministic systems in applications. When statistical parameter estimation is required to calibrate theoretical models to data, classical statistical estimators are often confronted to complex and potentially ill-posed optimization problem. As a consequence, alternative estimators to classical parametric estimators are needed for obtaining reliable estimates. We propose a gradient matching approach for the estimation of parametric Ordinary Differential Equations (ODE) observed with noise. Starting from a nonparametric proxy of a true solution of the ODE, we build a parametric estimator based on a variational characterization of the solution. As a Generalized Moment Estimator, our estimator must satisfy a set of orthogonal conditions that are solved in the least squares sense. Despite the use of a nonparametric estimator, we prove the - consistency and asymptotic normality of the Orthogonal Conditions estimator. We can derive confidence sets thanks to a closed-form expression for the asymptotic variance. Finally, the OC estimator is compared to classical estimators in several (simulated and real) experiments and ODE models to show its versatility and relevance with respect to classical Gradient Matching and Nonlinear Least Squares estimators. In particular, we show on a real dataset of influenza infection that the approach gives reliable estimates. Moreover, we show that our approach can deal directly with more elaborated models such as Delay Differential Equation (DDE). Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 173-185 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.841583 File-URL: http://hdl.handle.net/10.1080/01621459.2013.841583 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:173-185 Template-Type: ReDIF-Article 1.0 Author-Name: Alan Huang Author-X-Name-First: Alan Author-X-Name-Last: Huang Title: Joint Estimation of the Mean and Error Distribution in Generalized Linear Models Abstract: This article introduces a semiparametric extension of generalized linear models that is based on a full probability model, but does not require specification of an error distribution or variance function for the data. The approach involves treating the error distribution as an infinite-dimensional parameter, which is then estimated simultaneously with the mean-model parameters using a maximum empirical likelihood approach. The resulting estimators are shown to be consistent and jointly asymptotically normal in distribution. When interest lies only in inferences on the mean-model parameters, we show that maximizing out the error distribution leads to profile empirical log-likelihood ratio statistics that have asymptotic χ-super-2 distributions under the null. Simulation studies demonstrate that the proposed method can be more accurate than existing methods that offer the same level of flexibility and generality, especially with smaller sample sizes. The theoretical and numerical results are complemented by a data analysis example. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 186-196 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.824892 File-URL: http://hdl.handle.net/10.1080/01621459.2013.824892 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:186-196 Template-Type: ReDIF-Article 1.0 Author-Name: Christina D. Wang Author-X-Name-First: Christina D. Author-X-Name-Last: Wang Author-Name: Per A. Mykland Author-X-Name-First: Per A. Author-X-Name-Last: Mykland Title: The Estimation of Leverage Effect With High-Frequency Data Abstract: The leverage effect has become an extensively studied phenomenon that describes the (usually) negative relation between stock returns and their volatility. Although this characteristic of stock returns is well acknowledged, most studies of the phenomenon are based on cross-sectional calibration with parametric models. On the statistical side, most previous works are conducted over daily or longer return horizons, and few of them have carefully studied its estimation, especially with high-frequency data. However, estimation of the leverage effect is important because sensible inference is possible only when the leverage effect is estimated reliably. In this article, we provide nonparametric estimation for a class of stochastic measures of leverage effect. To construct estimators with good statistical properties, we introduce a new stochastic leverage effect parameter. The estimators and their statistical properties are provided in cases both with and without microstructure noise, under the stochastic volatility model. In asymptotics, the consistency and limiting distribution of the estimators are derived and corroborated by simulation results. For consistency, a previously unknown bias correction factor is added to the estimators. Applications of the estimators are also explored. This estimator provides the opportunity to study high-frequency regression, which leads to the prediction of volatility using not only previous volatility but also the leverage effect. The estimator also reveals a theoretical connection between skewness and the leverage effect, which further leads to the prediction of skewness. Furthermore, adopting the ideas similar to the estimation of the leverage effect, it is easy to extend the methods to study other important aspects of stock returns, such as volatility of volatility. Journal: Journal of the American Statistical Association Pages: 197-215 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.864189 File-URL: http://hdl.handle.net/10.1080/01621459.2013.864189 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:197-215 Template-Type: ReDIF-Article 1.0 Author-Name: Eun Ryung Lee Author-X-Name-First: Eun Ryung Author-X-Name-Last: Lee Author-Name: Hohsuk Noh Author-X-Name-First: Hohsuk Author-X-Name-Last: Noh Author-Name: Byeong U. Park Author-X-Name-First: Byeong U. Author-X-Name-Last: Park Title: Model Selection via Bayesian Information Criterion for Quantile Regression Models Abstract: Bayesian information criterion (BIC) is known to identify the true model consistently as long as the predictor dimension is finite. Recently, its moderate modifications have been shown to be consistent in model selection even when the number of variables diverges. Those works have been done mostly in mean regression, but rarely in quantile regression. The best-known results about BIC for quantile regression are for linear models with a fixed number of variables. In this article, we investigate how BIC can be adapted to high-dimensional linear quantile regression and show that a modified BIC is consistent in model selection when the number of variables diverges as the sample size increases. We also discuss how it can be used for choosing the regularization parameters of penalized approaches that are designed to conduct variable selection and shrinkage estimation simultaneously. Moreover, we extend the results to structured nonparametric quantile models with a diverging number of covariates. We illustrate our theoretical results via some simulated examples and a real data analysis on human eye disease. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 216-229 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.836975 File-URL: http://hdl.handle.net/10.1080/01621459.2013.836975 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:216-229 Template-Type: ReDIF-Article 1.0 Author-Name: Ruosha Li Author-X-Name-First: Ruosha Author-X-Name-Last: Li Author-Name: Yu Cheng Author-X-Name-First: Yu Author-X-Name-Last: Cheng Author-Name: Jason P. Fine Author-X-Name-First: Jason P. Author-X-Name-Last: Fine Title: Quantile Association Regression Models Abstract: It is often important to study the association between two continuous variables. In this work, we propose a novel regression framework for assessing conditional associations on quantiles. We develop general methodology which permits covariate effects on both the marginal quantile models for the two variables and their quantile associations. The proposed quantile copula models have straightforward interpretation, facilitating a comprehensive view of association structure which is much richer than that based on standard product moment and rank correlations. We show that the resulting estimators are uniformly consistent and weakly convergent as a process of the quantile index. Simple variance estimators are presented which perform well in numerical studies. Extensive simulations and a real data example demonstrate the practical utility of the methodology. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 230-242 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.847375 File-URL: http://hdl.handle.net/10.1080/01621459.2013.847375 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:230-242 Template-Type: ReDIF-Article 1.0 Author-Name: Ching-Kang Ing Author-X-Name-First: Ching-Kang Author-X-Name-Last: Ing Author-Name: Chiao-Yi Yang Author-X-Name-First: Chiao-Yi Author-X-Name-Last: Yang Title: Predictor Selection for Positive Autoregressive Processes Abstract: Let observations y1, …, yn be generated from a first-order autoregressive (AR) model with positive errors. In both the stationary and unit root cases, we derive moment bounds and limiting distributions of an extreme value estimator, , of the AR coefficient. These results enable us to provide asymptotic expressions for the mean squared error (MSE) of and the mean squared prediction error (MSPE) of the corresponding predictor, , of yn + 1. Based on these expressions, we compare the relative performance of () and the least-squares predictor (estimator) from the MSPE (MSE) point of view. Our comparison reveals that the better predictor (estimator) is determined not only by whether a unit root exists, but also by the behavior of the underlying error distribution near the origin, and hence is difficult to identify in practice. To circumvent this difficulty, we suggest choosing the predictor (estimator) with the smaller accumulated prediction error and show that the predictor (estimator) chosen in this way is asymptotically equivalent to the better one. Both real and simulated datasets are used to illustrate the proposed method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 243-253 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.836974 File-URL: http://hdl.handle.net/10.1080/01621459.2013.836974 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:243-253 Template-Type: ReDIF-Article 1.0 Author-Name: Tomohiro Ando Author-X-Name-First: Tomohiro Author-X-Name-Last: Ando Author-Name: Ker-Chau Li Author-X-Name-First: Ker-Chau Author-X-Name-Last: Li Title: A Model-Averaging Approach for High-Dimensional Regression Abstract: This article considers high-dimensional regression problems in which the number of predictors p exceeds the sample size n. We develop a model-averaging procedure for high-dimensional regression problems. Unlike most variable selection studies featuring the identification of true predictors, our focus here is on the prediction accuracy for the true conditional mean of y given the p predictors. Our method consists of two steps. The first step is to construct a class of regression models, each with a smaller number of regressors, to avoid the degeneracy of the information matrix. The second step is to find suitable model weights for averaging. To minimize the prediction error, we estimate the model weights using a delete-one cross-validation procedure. Departing from the literature of model averaging that requires the weights always sum to one, an important improvement we introduce is to remove this constraint. We derive some theoretical results to justify our procedure. A theorem is proved, showing that delete-one cross-validation achieves the lowest possible prediction loss asymptotically. This optimality result requires a condition that unravels an important feature of high-dimensional regression. The prediction error of any individual model in the class for averaging is required to be higher than the classic root n rate under the traditional parametric regression. This condition reflects the difficulty of high-dimensional regression and it depicts a situation especially meaningful for p > n. We also conduct a simulation study to illustrate the merits of the proposed approach over several existing methods, including lasso, group lasso, forward regression, Phase Coupled (PC)-simple algorithm, Akaike information criterion (AIC) model-averaging, Bayesian information criterion (BIC) model-averaging methods, and SCAD (smoothly clipped absolute deviation). This approach uses quadratic programming to overcome the computing time issue commonly encountered in the cross-validation literature. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 254-265 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.838168 File-URL: http://hdl.handle.net/10.1080/01621459.2013.838168 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:254-265 Template-Type: ReDIF-Article 1.0 Author-Name: Jingyuan Liu Author-X-Name-First: Jingyuan Author-X-Name-Last: Liu Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Author-Name: Rongling Wu Author-X-Name-First: Rongling Author-X-Name-Last: Wu Title: Feature Selection for Varying Coefficient Models With Ultrahigh-Dimensional Covariates Abstract: This article is concerned with feature screening and variable selection for varying coefficient models with ultrahigh-dimensional covariates. We propose a new feature screening procedure for these models based on conditional correlation coefficient. We systematically study the theoretical properties of the proposed procedure, and establish their sure screening property and the ranking consistency. To enhance the finite sample performance of the proposed procedure, we further develop an iterative feature screening procedure. Monte Carlo simulation studies were conducted to examine the performance of the proposed procedures. In practice, we advocate a two-stage approach for varying coefficient models. The two-stage approach consists of (a) reducing the ultrahigh dimensionality by using the proposed procedure and (b) applying regularization methods for dimension-reduced varying coefficient models to make statistical inferences on the coefficient functions. We illustrate the proposed two-stage approach by a real data example. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 266-274 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.850086 File-URL: http://hdl.handle.net/10.1080/01621459.2013.850086 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:266-274 Template-Type: ReDIF-Article 1.0 Author-Name: Fang Han Author-X-Name-First: Fang Author-X-Name-Last: Han Author-Name: Han Liu Author-X-Name-First: Han Author-X-Name-Last: Liu Title: Scale-Invariant Sparse PCA on High-Dimensional Meta-Elliptical Data Abstract: We propose a semiparametric method for conducting scale-invariant sparse principal component analysis (PCA) on high-dimensional non-Gaussian data. Compared with sparse PCA, our method has a weaker modeling assumption and is more robust to possible data contamination. Theoretically, the proposed method achieves a parametric rate of convergence in estimating the parameter of interests under a flexible semiparametric distribution family; computationally, the proposed method exploits a rank-based procedure and is as efficient as sparse PCA; empirically, our method outperforms most competing methods on both synthetic and real-world datasets. Journal: Journal of the American Statistical Association Pages: 275-287 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.844699 File-URL: http://hdl.handle.net/10.1080/01621459.2013.844699 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:275-287 Template-Type: ReDIF-Article 1.0 Author-Name: Lan Liu Author-X-Name-First: Lan Author-X-Name-Last: Liu Author-Name: Michael G. Hudgens Author-X-Name-First: Michael G. Author-X-Name-Last: Hudgens Title: Large Sample Randomization Inference of Causal Effects in the Presence of Interference Abstract: Recently, there has been increasing interest in making causal inference when interference is possible. In the presence of interference, treatment may have several types of effects. In this article, we consider inference about such effects when the population consists of groups of individuals where interference is possible within groups but not between groups. A two-stage randomization design is assumed where in the first stage groups are randomized to different treatment allocation strategies and in the second stage individuals are randomized to treatment or control conditional on the strategy assigned to their group in the first stage. For this design, the asymptotic distributions of estimators of the causal effects are derived when either the number of individuals per group or the number of groups grows large. Under certain homogeneity assumptions, the asymptotic distributions provide justification for Wald-type confidence intervals (CIs) and tests. Empirical results demonstrate that the Wald CIs have good coverage in finite samples and are narrower than CIs based on either the Chebyshev or Hoeffding inequalities provided the number of groups is not too small. The methods are illustrated by two examples which consider the effects of cholera vaccination and an intervention to encourage voting. Journal: Journal of the American Statistical Association Pages: 288-301 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.844698 File-URL: http://hdl.handle.net/10.1080/01621459.2013.844698 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:288-301 Template-Type: ReDIF-Article 1.0 Author-Name: A. C. Davison Author-X-Name-First: A. C. Author-X-Name-Last: Davison Author-Name: D. A. S. Fraser Author-X-Name-First: D. A. S. Author-X-Name-Last: Fraser Author-Name: N. Reid Author-X-Name-First: N. Author-X-Name-Last: Reid Author-Name: N. Sartori Author-X-Name-First: N. Author-X-Name-Last: Sartori Title: Accurate Directional Inference for Vector Parameters in Linear Exponential Families Abstract: We consider inference on a vector-valued parameter of interest in a linear exponential family, in the presence of a finite-dimensional nuisance parameter. Based on higher-order asymptotic theory for likelihood, we propose a directional test whose p-value is computed using one-dimensional integration. The work simplifies and develops earlier research on directional tests for continuous models and on higher-order inference for discrete models, and the examples include contingency tables and logistic regression. Examples and simulations illustrate the high accuracy of the method, which we compare with the usual likelihood ratio test and with an adjusted version due to Skovgaard. In high-dimensional settings, such as covariance selection, the approach works essentially perfectly, whereas its competitors can fail catastrophically. Journal: Journal of the American Statistical Association Pages: 302-314 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.839451 File-URL: http://hdl.handle.net/10.1080/01621459.2013.839451 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:302-314 Template-Type: ReDIF-Article 1.0 Author-Name: Simon Barthelmé Author-X-Name-First: Simon Author-X-Name-Last: Barthelmé Author-Name: Nicolas Chopin Author-X-Name-First: Nicolas Author-X-Name-Last: Chopin Title: Expectation Propagation for Likelihood-Free Inference Abstract: Many models of interest in the natural and social sciences have no closed-form likelihood function, which means that they cannot be treated using the usual techniques of statistical inference. In the case where such models can be efficiently simulated, Bayesian inference is still possible thanks to the approximate Bayesian computation (ABC) algorithm. Although many refinements have been suggested, ABC inference is still far from routine. ABC is often excruciatingly slow due to very low acceptance rates. In addition, ABC requires introducing a vector of "summary statistics" s ( y ), the choice of which is relatively arbitrary, and often require some trial and error, making the whole process laborious for the user. We introduce in this work the EP-ABC algorithm, which is an adaptation to the likelihood-free context of the variational approximation algorithm known as expectation propagation. The main advantage of EP-ABC is that it is faster by a few orders of magnitude than standard algorithms, while producing an overall approximation error that is typically negligible. A second advantage of EP-ABC is that it replaces the usual global ABC constraint ‖ s ( y ) - s ( y -super-⋆)‖ ⩽ ϵ, where s ( y -super-⋆) is the vector of summary statistics computed on the whole dataset, by n local constraints of the form ‖si (yi ) - si (y -super-⋆ i )‖ ⩽ ϵ that apply separately to each data point. In particular, it is often possible to take si (yi ) = yi , making it possible to do away with summary statistics entirely. In that case, EP-ABC makes it possible to approximate directly the evidence (marginal likelihood) of the model. Comparisons are performed in three real-world applications that are typical of likelihood-free inference, including one application in neuroscience that is novel, and possibly too challenging for standard ABC techniques. Journal: Journal of the American Statistical Association Pages: 315-333 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.864178 File-URL: http://hdl.handle.net/10.1080/01621459.2013.864178 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:315-333 Template-Type: ReDIF-Article 1.0 Author-Name: David S. Matteson Author-X-Name-First: David S. Author-X-Name-Last: Matteson Author-Name: Nicholas A. James Author-X-Name-First: Nicholas A. Author-X-Name-Last: James Title: A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data Abstract: Change point analysis has applications in a wide variety of fields. The general problem concerns the inference of a change in distribution for a set of time-ordered observations. Sequential detection is an online version in which new data are continually arriving and are analyzed adaptively. We are concerned with the related, but distinct, offline version, in which retrospective analysis of an entire sequence is performed. For a set of multivariate observations of arbitrary dimension, we consider nonparametric estimation of both the number of change points and the positions at which they occur. We do not make any assumptions regarding the nature of the change in distribution or any distribution assumptions beyond the existence of the αth absolute moment, for some α is an element of (0, 2). Estimation is based on hierarchical clustering and we propose both divisive and agglomerative algorithms. The divisive method is shown to provide consistent estimates of both the number and the location of change points under standard regularity assumptions. We compare the proposed approach with competing methods in a simulation study. Methods from cluster analysis are applied to assess performance and to allow simple comparisons of location estimates, even when the estimated number differs. We conclude with applications in genetics, finance, and spatio-temporal analysis. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 334-345 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.849605 File-URL: http://hdl.handle.net/10.1080/01621459.2013.849605 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:334-345 Template-Type: ReDIF-Article 1.0 Author-Name: Gery Geenens Author-X-Name-First: Gery Author-X-Name-Last: Geenens Title: Probit Transformation for Kernel Density Estimation on the Unit Interval Abstract: Kernel estimation of a probability density function supported on the unit interval has proved difficult, because of the well-known boundary bias issues a conventional kernel density estimator would necessarily face in this situation. Transforming the variable of interest into a variable whose density has unconstrained support, estimating that density, and obtaining an estimate of the density of the original variable through back-transformation, seems a natural idea to easily get rid of the boundary problems. In practice, however, a simple and efficient implementation of this methodology is far from immediate, and the few attempts found in the literature have been reported not to perform well. In this article, the main reasons for this failure are identified and an easy way to correct them is suggested. It turns out that combining the transformation idea with local likelihood density estimation produces viable density estimators, mostly free from boundary issues. Their asymptotic properties are derived, and a practical cross-validation bandwidth selection rule is devised. Extensive simulations demonstrate the excellent performance of these estimators compared to their main competitors for a wide range of density shapes. In fact, they turn out to be the best choice overall. Finally, they are used to successfully estimate a density of nonstandard shape supported on [0, 1] from a small-size real data sample. Journal: Journal of the American Statistical Association Pages: 346-358 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.842173 File-URL: http://hdl.handle.net/10.1080/01621459.2013.842173 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:346-358 Template-Type: ReDIF-Article 1.0 Author-Name: Kenji Fukumizu Author-X-Name-First: Kenji Author-X-Name-Last: Fukumizu Author-Name: Chenlei Leng Author-X-Name-First: Chenlei Author-X-Name-Last: Leng Title: Gradient-Based Kernel Dimension Reduction for Regression Abstract: This article proposes a novel approach to linear dimension reduction for regression using nonparametric estimation with positive-definite kernels or reproducing kernel Hilbert spaces (RKHSs). The purpose of the dimension reduction is to find such directions in the explanatory variables that explain the response sufficiently: this is called sufficient dimension reduction. The proposed method is based on an estimator for the gradient of the regression function considered for the feature vectors mapped into RKHSs. It is proved that the method is able to estimate the directions that achieve sufficient dimension reduction. In comparison with other existing methods, the proposed one has wide applicability without strong assumptions on the distributions or the type of variables, and needs only eigendecomposition for estimating the projection matrix. The theoretical analysis shows that the estimator is consistent with certain rate under some conditions. The experimental results demonstrate that the proposed method successfully finds effective directions with efficient computation even for high-dimensional explanatory variables. Journal: Journal of the American Statistical Association Pages: 359-370 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.838167 File-URL: http://hdl.handle.net/10.1080/01621459.2013.838167 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:359-370 Template-Type: ReDIF-Article 1.0 Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: D. Y. Lin Author-X-Name-First: D. Y. Author-X-Name-Last: Lin Title: Efficient Estimation of Semiparametric Transformation Models for Two-Phase Cohort Studies Abstract: Under two-phase cohort designs, such as case--cohort and nested case--control sampling, information on observed event times, event indicators, and inexpensive covariates is collected in the first phase, and the first-phase information is used to select subjects for measurements of expensive covariates in the second phase; inexpensive covariates are also used in the data analysis to control for confounding and to evaluate interactions. This article provides efficient estimation of semiparametric transformation models for such designs, accommodating both discrete and continuous covariates, and allowing inexpensive and expensive covariates to be correlated. The estimation is based on the maximization of a modified nonparametric likelihood function through a generalization of the expectation--maximization algorithm. The resulting estimators are shown to be consistent, asymptotically normal and asymptotically efficient with easily estimated variances. Simulation studies demonstrate that the asymptotic approximations are accurate in practical situations. Empirical data from Wilms' tumor studies and the Atherosclerosis Risk in Communities (ARIC) study are presented. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 371-383 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.842172 File-URL: http://hdl.handle.net/10.1080/01621459.2013.842172 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:371-383 Template-Type: ReDIF-Article 1.0 Author-Name: Layla Parast Author-X-Name-First: Layla Author-X-Name-Last: Parast Author-Name: Lu Tian Author-X-Name-First: Lu Author-X-Name-Last: Tian Author-Name: Tianxi Cai Author-X-Name-First: Tianxi Author-X-Name-Last: Cai Title: Landmark Estimation of Survival and Treatment Effect in a Randomized Clinical Trial Abstract: In many studies with a survival outcome, it is often not feasible to fully observe the primary event of interest. This often leads to heavy censoring and thus, difficulty in efficiently estimating survival or comparing survival rates between two groups. In certain diseases, baseline covariates and the event time of nonfatal intermediate events may be associated with overall survival. In these settings, incorporating such additional information may lead to gains in efficiency in estimation of survival and testing for a difference in survival between two treatment groups. If gains in efficiency can be achieved, it may then be possible to decrease the sample size of patients required for a study to achieve a particular power level or decrease the duration of the study. Most existing methods for incorporating intermediate events and covariates to predict survival focus on estimation of relative risk parameters and/or the joint distribution of events under semiparametric models. However, in practice, these model assumptions may not hold and hence may lead to biased estimates of the marginal survival. In this article, we propose a seminonparametric two-stage procedure to estimate and compare t-year survival rates by incorporating intermediate event information observed before some landmark time, which serves as a useful approach to overcome semicompeting risk issues. In a randomized clinical trial setting, we further improve efficiency through an additional calibration step. Simulation studies demonstrate substantial potential gains in efficiency in terms of estimation and power. We illustrate our proposed procedures using an AIDS Clinical Trial Protocol 175 dataset by estimating survival and examining the difference in survival between two treatment groups: zidovudine and zidovudine plus zalcitabine. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 384-394 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.842488 File-URL: http://hdl.handle.net/10.1080/01621459.2013.842488 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:384-394 Template-Type: ReDIF-Article 1.0 Author-Name: Bruce G. Lindsay Author-X-Name-First: Bruce G. Author-X-Name-Last: Lindsay Author-Name: Marianthi Markatou Author-X-Name-First: Marianthi Author-X-Name-Last: Markatou Author-Name: Surajit Ray Author-X-Name-First: Surajit Author-X-Name-Last: Ray Title: Kernels, Degrees of Freedom, and Power Properties of Quadratic Distance Goodness-of-Fit Tests Abstract: In this article, we study the power properties of quadratic-distance-based goodness-of-fit tests. First, we introduce the concept of a root kernel and discuss the considerations that enter the selection of this kernel. We derive an easy to use normal approximation to the power of quadratic distance goodness-of-fit tests and base the construction of a noncentrality index, an analogue of the traditional noncentrality parameter, on it. This leads to a method akin to the Neyman-Pearson lemma for constructing optimal kernels for specific alternatives. We then introduce a midpower analysis as a device for choosing optimal degrees of freedom for a family of alternatives of interest. Finally, we introduce a new diffusion kernel, called the Pearson-normal kernel, and study the extent to which the normal approximation to the power of tests based on this kernel is valid. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 395-410 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.836972 File-URL: http://hdl.handle.net/10.1080/01621459.2013.836972 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:395-410 Template-Type: ReDIF-Article 1.0 Author-Name: Gerda Claeskens Author-X-Name-First: Gerda Author-X-Name-Last: Claeskens Author-Name: Mia Hubert Author-X-Name-First: Mia Author-X-Name-Last: Hubert Author-Name: Leen Slaets Author-X-Name-First: Leen Author-X-Name-Last: Slaets Author-Name: Kaveh Vakili Author-X-Name-First: Kaveh Author-X-Name-Last: Vakili Title: Multivariate Functional Halfspace Depth Abstract: This article defines and studies a depth for multivariate functional data. By the multivariate nature and by including a weight function, it acknowledges important characteristics of functional data, namely differences in the amount of local amplitude, shape, and phase variation. We study both population and finite sample versions. The multivariate sample of curves may include warping functions, derivatives, and integrals of the original curves for a better overall representation of the functional data via the depth. We present a simulation study and data examples that confirm the good performance of this depth function. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 411-423 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.856795 File-URL: http://hdl.handle.net/10.1080/01621459.2013.856795 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:411-423 Template-Type: ReDIF-Article 1.0 Author-Name: Victor M. Panaretos Author-X-Name-First: Victor M. Author-X-Name-Last: Panaretos Author-Name: Tung Pham Author-X-Name-First: Tung Author-X-Name-Last: Pham Author-Name: Zhigang Yao Author-X-Name-First: Zhigang Author-X-Name-Last: Yao Title: Principal Flows Abstract: We revisit the problem of extending the notion of principal component analysis (PCA) to multivariate datasets that satisfy nonlinear constraints, therefore lying on Riemannian manifolds. Our aim is to determine curves on the manifold that retain their canonical interpretability as principal components, while at the same time being flexible enough to capture nongeodesic forms of variation. We introduce the concept of a principal flow, a curve on the manifold passing through the mean of the data, and with the property that, at any point of the curve, the tangent velocity vector attempts to fit the first eigenvector of a tangent space PCA locally at that same point, subject to a smoothness constraint. That is, a particle flowing along the principal flow attempts to move along a path of maximal variation of the data, up to smoothness constraints. The rigorous definition of a principal flow is given by means of a Lagrangian variational problem, and its solution is reduced to an ODE problem via the Euler--Lagrange method. Conditions for existence and uniqueness are provided, and an algorithm is outlined for the numerical solution of the problem. Higher order principal flows are also defined. It is shown that global principal flows yield the usual principal components on a Euclidean space. By means of examples, it is illustrated that the principal flow is able to capture patterns of variation that can escape other manifold PCA methods. Journal: Journal of the American Statistical Association Pages: 424-436 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2013.849199 File-URL: http://hdl.handle.net/10.1080/01621459.2013.849199 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:424-436 Template-Type: ReDIF-Article 1.0 Author-Name: Suprateek Kundu Author-X-Name-First: Suprateek Author-X-Name-Last: Kundu Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Bayes Variable Selection in Semiparametric Linear Models Abstract: There is a rich literature on Bayesian variable selection for parametric models. Our focus is on generalizing methods and asymptotic theory established for mixtures of g-priors to semiparametric linear regression models having unknown residual densities. Using a Dirichlet process location mixture for the residual density, we propose a semiparametric g-prior which incorporates an unknown matrix of cluster allocation indicators. For this class of priors, posterior computation can proceed via a straightforward stochastic search variable selection algorithm. In addition, Bayes' factor and variable selection consistency is shown to result under a class of proper priors on g even when the number of candidate predictors p is allowed to increase much faster than sample size n, while making sparsity assumptions on the true model size. Journal: Journal of the American Statistical Association Pages: 437-447 Issue: 505 Volume: 109 Year: 2014 Month: 3 X-DOI: 10.1080/01621459.2014.881153 File-URL: http://hdl.handle.net/10.1080/01621459.2014.881153 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:437-447 Template-Type: ReDIF-Article 1.0 Author-Name: Yongling Xiao Author-X-Name-First: Yongling Author-X-Name-Last: Xiao Author-Name: Michal Abrahamowicz Author-X-Name-First: Michal Author-X-Name-Last: Abrahamowicz Author-Name: Erica E. M. Moodie Author-X-Name-First: Erica E. M. Author-X-Name-Last: Moodie Author-Name: Rainer Weber Author-X-Name-First: Rainer Author-X-Name-Last: Weber Author-Name: James Young Author-X-Name-First: James Author-X-Name-Last: Young Title: Flexible Marginal Structural Models for Estimating the Cumulative Effect of a Time-Dependent Treatment on the Hazard: Reassessing the Cardiovascular Risks of Didanosine Treatment in the Swiss HIV Cohort Study Abstract: The association between antiretroviral treatment and cardiovascular disease (CVD) risk in HIV-positive persons has been the subject of much debate since the Data collection on Adverse events of Anti-HIV Drugs (D:A:D) study reported that recent use of two antiretroviral drugs, abacavir (ABC) and didanosine (DDI), was associated with increased risk. We focus on the potential impact of DDI use, as this drug has not been as studied intensively as ABC. We propose a flexible marginal structural Cox model with weighted cumulative exposure modeling (Cox WCE MSM) to address two key challenges encountered when using observational longitudinal data to assess the adverse effects of medication: (1) the need to model the cumulative effect of a time-dependent treatment and (2) the need to control for time-dependent confounders that also act as mediators of the effect of past treatment. Simulations confirm that the Cox WCE MSM yields accurate estimates of the causal treatment effect given complex exposure effects and time-dependent confounding. We then use the new flexible Cox WCE MSM to assess the association between DDI use and CVD risk in the Swiss HIV Cohort Study. In contrast to the nonsignificant results obtained with conventional parametric Cox MSMs, our new Cox WCE MSM identifies a significant short-term risk increase due to DDI use in the previous year. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 455-464 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.872650 File-URL: http://hdl.handle.net/10.1080/01621459.2013.872650 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:455-464 Template-Type: ReDIF-Article 1.0 Author-Name: Eduardo S. Ayra Author-X-Name-First: Eduardo S. Author-X-Name-Last: Ayra Author-Name: David Ríos Insua Author-X-Name-First: David Ríos Author-X-Name-Last: Insua Author-Name: Javier Cano Author-X-Name-First: Javier Author-X-Name-Last: Cano Title: To Fuel or Not to Fuel? Is that the Question? Abstract: According to the International Air Transport Association, the industry fuel bill accounts for more than 25% of the annual airline operating costs. In times of severe economic constraints and increasing fuel costs, air carriers are looking for ways to reduce costs and improve fuel efficiency without putting flight safety into jeopardy. In particular, this is inducing discussions on how much additional fuel to put in a planned route to avoid diverting to an alternate airport due to Air Traffic Flow Management delays. We provide here a general model to support such decisions. We illustrate it with a case study and provide comparison with the current practice, showing the relevance of our approach. Journal: Journal of the American Statistical Association Pages: 465-476 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.879060 File-URL: http://hdl.handle.net/10.1080/01621459.2013.879060 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:465-476 Template-Type: ReDIF-Article 1.0 Author-Name: Isadora Antoniano-Villalobos Author-X-Name-First: Isadora Author-X-Name-Last: Antoniano-Villalobos Author-Name: Sara Wade Author-X-Name-First: Sara Author-X-Name-Last: Wade Author-Name: Stephen G. Walker Author-X-Name-First: Stephen G. Author-X-Name-Last: Walker Title: A Bayesian Nonparametric Regression Model With Normalized Weights: A Study of Hippocampal Atrophy in Alzheimer's Disease Abstract: Hippocampal volume is one of the best established biomarkers for Alzheimer's disease. However, for appropriate use in clinical trials research, the evolution of hippocampal volume needs to be well understood. Recent theoretical models propose a sigmoidal pattern for its evolution. To support this theory, the use of Bayesian nonparametric regression mixture models seems particularly suitable due to the flexibility that models of this type can achieve and the unsatisfactory predictive properties of semiparametric methods. In this article, our aim is to develop an interpretable Bayesian nonparametric regression model which allows inference with combinations of both continuous and discrete covariates, as required for a full analysis of the dataset. Simple arguments regarding the interpretation of Bayesian nonparametric regression mixtures lead naturally to regression weights based on normalized sums. Difficulty in working with the intractable normalizing constant is overcome thanks to recent advances in MCMC methods and the development of a novel auxiliary variable scheme. We apply the new model and MCMC method to study the dynamics of hippocampal volume, and our results provide statistical evidence in support of the theoretical hypothesis. Journal: Journal of the American Statistical Association Pages: 477-490 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.879061 File-URL: http://hdl.handle.net/10.1080/01621459.2013.879061 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:477-490 Template-Type: ReDIF-Article 1.0 Author-Name: Daniel M. Percival Author-X-Name-First: Daniel M. Author-X-Name-Last: Percival Author-Name: Donald B. Percival Author-X-Name-First: Donald B. Author-X-Name-Last: Percival Author-Name: Donald W. Denbo Author-X-Name-First: Donald W. Author-X-Name-Last: Denbo Author-Name: Edison Gica Author-X-Name-First: Edison Author-X-Name-Last: Gica Author-Name: Paul Y. Huang Author-X-Name-First: Paul Y. Author-X-Name-Last: Huang Author-Name: Harold O. Mofjeld Author-X-Name-First: Harold O. Author-X-Name-Last: Mofjeld Author-Name: Michael C. Spillane Author-X-Name-First: Michael C. Author-X-Name-Last: Spillane Title: Automated Tsunami Source Modeling Using the Sweeping Window Positive Elastic Net Abstract: In response to hazards posed by earthquake-induced tsunamis, the National Oceanographic and Atmospheric Administration developed a system for issuing timely warnings to coastal communities. This system, in part, involves matching data collected in real time from deep-ocean buoys to a database of precomputed geophysical models, each associated with a geographical location. Currently, trained operators must handpick models from the database using the epicenter of the earthquake as guidance, which can delay issuing of warnings. In this article, we introduce an automatic procedure to select models to improve the timing and accuracy of these warnings. This procedure uses an elastic-net-based penalized and constrained linear least-squares estimator in conjunction with a sweeping window. This window ensures that selected models are close spatially, which is desirable from geophysical considerations. We use the Akaike information criterion to settle on a particular window and to set the tuning parameters associated with the elastic net. Test data from the 2006 Kuril Islands and the devastating 2011 Japan tsunamis show that the automatic procedure yields model fits and verification equal to or better than those from a time-consuming hand-selected solution. Journal: Journal of the American Statistical Association Pages: 491-499 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.879062 File-URL: http://hdl.handle.net/10.1080/01621459.2013.879062 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:491-499 Template-Type: ReDIF-Article 1.0 Author-Name: Carl Schmertmann Author-X-Name-First: Carl Author-X-Name-Last: Schmertmann Author-Name: Emilio Zagheni Author-X-Name-First: Emilio Author-X-Name-Last: Zagheni Author-Name: Joshua R. Goldstein Author-X-Name-First: Joshua R. Author-X-Name-Last: Goldstein Author-Name: Mikko Myrskylä Author-X-Name-First: Mikko Author-X-Name-Last: Myrskylä Title: Bayesian Forecasting of Cohort Fertility Abstract: There are signs that fertility in rich countries may have stopped declining, but this depends critically on whether women currently in reproductive ages are postponing or reducing lifetime fertility. Analysis of average completed family sizes requires forecasts of remaining fertility for women born 1970-1995. We propose a Bayesian model for fertility that incorporates a priori information about patterns over age and time. We use a new dataset, the Human Fertility Database (HFD), to construct improper priors that give high weight to historically plausible rate surfaces. In the age dimension, cohort schedules should be well approximated by principal components of HFD schedules. In the time dimension, series should be smooth and approximately linear over short spans. We calibrate priors so that approximation residuals have theoretical distributions similar to historical HFD data. Our priors use quadratic penalties and imply a high-dimensional normal posterior distribution for each country's fertility surface. Forecasts for HFD cohorts currently aged 15-44 show consistent patterns. In the United States, Northern Europe, and Western Europe, slight rebounds in completed fertility are likely. In Central and Southern Europe, East Asia, and Brazil, there is little evidence for a rebound. Our methods could be applied to other forecasting and missing-data problems with only minor modifications. Journal: Journal of the American Statistical Association Pages: 500-513 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2014.881738 File-URL: http://hdl.handle.net/10.1080/01621459.2014.881738 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:500-513 Template-Type: ReDIF-Article 1.0 Author-Name: Dandan Liu Author-X-Name-First: Dandan Author-X-Name-Last: Liu Author-Name: Yingye Zheng Author-X-Name-First: Yingye Author-X-Name-Last: Zheng Author-Name: Ross L. Prentice Author-X-Name-First: Ross L. Author-X-Name-Last: Prentice Author-Name: Li Hsu Author-X-Name-First: Li Author-X-Name-Last: Hsu Title: Estimating Risk With Time-to-Event Data: An Application to the Women's Health Initiative Abstract: Accurate and individualized risk prediction is critical for population control of chronic diseases such as cancer and cardiovascular disease. Large cohort studies provide valuable resources for building risk prediction models, as the risk factors are collected at the baseline and subjects are followed over time until disease occurrence or termination of the study. However, for rare diseases the baseline risk may not be estimated reliably based on cohort data only, due to sparse events. In this article, we propose to make use of external information to improve efficiency for estimating time-dependent absolute risk. We derive the relationship between external disease incidence rates and the baseline risk, and incorporate the external disease incidence information into estimation of absolute risks, while allowing for potential difference of disease incidence rates between cohort and external sources. The asymptotic properties, namely, uniform consistency and weak convergence, of the proposed estimators are established. Simulation results show that the proposed estimator for absolute risk is more efficient than that based on the Breslow estimator, which does not use external disease incidence rates. A large cohort study, the Women's Health Initiative Observational Study, is used to illustrate the proposed method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 514-524 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2014.881739 File-URL: http://hdl.handle.net/10.1080/01621459.2014.881739 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:514-524 Template-Type: ReDIF-Article 1.0 Author-Name: Ick Hoon Jin Author-X-Name-First: Ick Hoon Author-X-Name-Last: Jin Author-Name: Suyu Liu Author-X-Name-First: Suyu Author-X-Name-Last: Liu Author-Name: Peter F. Thall Author-X-Name-First: Peter F. Author-X-Name-Last: Thall Author-Name: Ying Yuan Author-X-Name-First: Ying Author-X-Name-Last: Yuan Title: Using Data Augmentation to Facilitate Conduct of Phase I-II Clinical Trials With Delayed Outcomes Abstract: A practical impediment in adaptive clinical trials is that outcomes must be observed soon enough to apply decision rules to choose treatments for new patients. For example, if outcomes take up to six weeks to evaluate and the accrual rate is one patient per week, on average three new patients will be accrued while waiting to evaluate the outcomes of the previous three patients. The question is how to treat the new patients. This logistical problem persists throughout the trial. Various ad hoc practical solutions are used, none entirely satisfactory. We focus on this problem in phase I-II clinical trials that use binary toxicity and efficacy, defined in terms of event times, to choose doses adaptively for successive cohorts. We propose a general approach to this problem that treats late-onset outcomes as missing data, uses data augmentation to impute missing outcomes from posterior predictive distributions computed from partial follow-up times and complete outcome data, and applies the design's decision rules using the completed data. We illustrate the method with two cancer trials conducted using a phase I-II design based on efficacy-toxicity trade-offs, including a computer stimulation study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 525-536 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2014.881740 File-URL: http://hdl.handle.net/10.1080/01621459.2014.881740 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:525-536 Template-Type: ReDIF-Article 1.0 Author-Name: Haim Y. Bar Author-X-Name-First: Haim Y. Author-X-Name-Last: Bar Author-Name: James G. Booth Author-X-Name-First: James G. Author-X-Name-Last: Booth Author-Name: Martin T. Wells Author-X-Name-First: Martin T. Author-X-Name-Last: Wells Title: A Bivariate Model for Simultaneous Testing in Bioinformatics Data Abstract: We develop a novel approach for testing treatment effects in high-throughput data. Most previous works on this topic focused on testing for differences between the means, but recently it has been recognized that testing for differential variation is probably as important. We take it a step further, and introduce a bivariate model modeling strategy which accounts for both differential expression and differential variation. Our model-based approach, in which the differential mean and variance are considered random effects, results in shrinkage estimation and powerful tests as it borrows strength across levels. We show in simulations that the method yields a substantial gain in the power to detect differential means when differential variation is present. Our case studies show that the model is realistic in a wide range of applications. Furthermore, a hierarchical estimation approach implemented using the EM algorithm results in a computationally efficient method which is particularly well-suited for "multiple testing" situations. Finally, we develop a power and sample size calculation tool that mirrors the estimation and inference method described in this article, and can be used to design experiments involving thousands of simultaneous tests. Journal: Journal of the American Statistical Association Pages: 537-547 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2014.884502 File-URL: http://hdl.handle.net/10.1080/01621459.2014.884502 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:537-547 Template-Type: ReDIF-Article 1.0 Author-Name: Xiaosun Lu Author-X-Name-First: Xiaosun Author-X-Name-Last: Lu Author-Name: J. S. Marron Author-X-Name-First: J. S. Author-X-Name-Last: Marron Author-Name: Perry Haaland Author-X-Name-First: Perry Author-X-Name-Last: Haaland Title: Object-Oriented Data Analysis of Cell Images Abstract: This article discusses a study of cell images in cell culture biology from an object-oriented point of view. The motivation of this research is to develop a statistical approach to cell image analysis that better supports the automated development of stem cell growth media. A major hurdle in this process is the need for human expertise, based on studying cells under the microscope, to make decisions about the next step of the cell culture process. We aim to use digital imaging technology coupled with statistical analysis to tackle this important problem. The discussion in this article highlights a common critical issue: choice of data objects. Instead of conventionally treating either the individual cells or the wells (a container in which the cells are grown) as data objects, a new type of data object is proposed, that is the union of a well with its corresponding set of cells. The image data analysis suggests that the cell-well unions can be a better choice of data objects than the cells or the wells alone. The data are available in the online supplementary materials. Journal: Journal of the American Statistical Association Pages: 548-559 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2014.884503 File-URL: http://hdl.handle.net/10.1080/01621459.2014.884503 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:548-559 Template-Type: ReDIF-Article 1.0 Author-Name: Kun Chen Author-X-Name-First: Kun Author-X-Name-Last: Chen Author-Name: Kung-Sik Chan Author-X-Name-First: Kung-Sik Author-X-Name-Last: Chan Author-Name: Nils Chr. Stenseth Author-X-Name-First: Nils Chr. Author-X-Name-Last: Stenseth Title: Source-Sink Reconstruction Through Regularized Multicomponent Regression Analysis-With Application to Assessing Whether North Sea Cod Larvae Contributed to Local Fjord Cod in Skagerrak Abstract: The problem of reconstructing the source-sink dynamics arises in many biological systems. Our research is motivated by marine applications where newborns are passively dispersed by ocean currents from several potential spawning sources to settle in various nursery regions that collectively constitute the sink. The reconstruction of the sparse source-sink linkage pattern, that is, to identify which sources contribute to which regions in the sink, is a challenging task in marine ecology. We derive a constrained nonlinear multicomponent regression model for source-sink reconstruction, which is capable of simultaneously selecting important linkages from the sources to the sink regions and making inference about the unobserved spawning activities at the sources. A sparsity-inducing and nonnegativity-constrained regularization approach is developed for model estimation, and theoretically we show that our estimator enjoys the oracle properties. The empirical performance of the method is investigated via simulation studies mimicking real ecological applications. We examine the transport hypothesis that Atlantic cod larvae were transported by sea currents from the North Sea to a few exposed coastal fjords along the Norwegian Skagerrak. Our findings of the spawning date distribution is consistent with results from previous studies, and the proposed approach for the first time provides valid statistical support for the larval drift conjecture. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 560-573 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2014.898583 File-URL: http://hdl.handle.net/10.1080/01621459.2014.898583 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:560-573 Template-Type: ReDIF-Article 1.0 Author-Name: L. A. Stefanski Author-X-Name-First: L. A. Author-X-Name-Last: Stefanski Author-Name: Yichao Wu Author-X-Name-First: Yichao Author-X-Name-Last: Wu Author-Name: Kyle White Author-X-Name-First: Kyle Author-X-Name-Last: White Title: Variable Selection in Nonparametric Classification Via Measurement Error Model Selection Likelihoods Abstract: Using the relationships among ridge regression, LASSO estimation, and measurement error attenuation as motivation, a new measurement-error-model-based approach to variable selection is developed. After describing the approach in the familiar context of linear regression, we apply it to the problem of variable selection in nonparametric classification, resulting in a new kernel-based classifier with LASSO-like shrinkage and variable-selection properties. Finite-sample performance of the new classification method is studied via simulation and real data examples, and consistency of the method is studied theoretically. Supplementary materials for the article are available online. Journal: Journal of the American Statistical Association Pages: 574-589 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.858630 File-URL: http://hdl.handle.net/10.1080/01621459.2013.858630 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:574-589 Template-Type: ReDIF-Article 1.0 Author-Name: Ngai Hang Chan Author-X-Name-First: Ngai Hang Author-X-Name-Last: Chan Author-Name: Chun Yip Yau Author-X-Name-First: Chun Yip Author-X-Name-Last: Yau Author-Name: Rong-Mao Zhang Author-X-Name-First: Rong-Mao Author-X-Name-Last: Zhang Title: Group LASSO for Structural Break Time Series Abstract: Consider a structural break autoregressive (SBAR) process where j = 1, ..., m + 1, {t 1, ..., tm } are change-points, 1 = t 0 > t 1 > ⋅⋅⋅ > t m + 1 = n + 1, σ( · ) is a measurable function on , and {ϵ t } are white noise with unit variance. In practice, the number of change-points m is usually assumed to be known and small, because a large m would involve a huge amount of computational burden for parameters estimation. By reformulating the problem in a variable selection context, the group least absolute shrinkage and selection operator (LASSO) is proposed to estimate an SBAR model when m is unknown. It is shown that both m and the locations of the change-points {t 1, ..., tm } can be consistently estimated from the data, and the computation can be efficiently performed. An improved practical version that incorporates group LASSO and the stepwise regression variable selection technique are discussed. Simulation studies are conducted to assess the finite sample performance. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 590-599 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.866566 File-URL: http://hdl.handle.net/10.1080/01621459.2013.866566 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:590-599 Template-Type: ReDIF-Article 1.0 Author-Name: Guangming Pan Author-X-Name-First: Guangming Author-X-Name-Last: Pan Author-Name: Jiti Gao Author-X-Name-First: Jiti Author-X-Name-Last: Gao Author-Name: Yanrong Yang Author-X-Name-First: Yanrong Author-X-Name-Last: Yang Title: Testing Independence Among a Large Number of High-Dimensional Random Vectors Abstract: Capturing dependence among a large number of high-dimensional random vectors is a very important and challenging problem. By arranging n random vectors of length p in the form of a matrix, we develop a linear spectral statistic of the constructed matrix to test whether the n random vectors are independent or not. Specifically, the proposed statistic can also be applied to n random vectors, each of whose elements can be written as either a linear stationary process or a linear combination of independent random variables. The asymptotic distribution of the proposed test statistic is established for the case of as n → ∞. To avoid estimating the spectrum of each random vector, a modified test statistic, which is based on splitting the original n vectors into two equal parts and eliminating the term that contains the inner structure of each random vector or time series, is constructed. The facts that the limiting distribution is normal and there is no need to know the inner structure of each investigated random vector result in simple implementation of the constructed test statistic. Simulation results demonstrate that the proposed test is powerful against several commonly used dependence structures. An empirical application to detecting dependence of the closed prices from several stocks in the S&P500 also illustrates the applicability and effectiveness of our provided test. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 600-612 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.872037 File-URL: http://hdl.handle.net/10.1080/01621459.2013.872037 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:600-612 Template-Type: ReDIF-Article 1.0 Author-Name: Giorgos Minas Author-X-Name-First: Giorgos Author-X-Name-Last: Minas Author-Name: John A.D. Aston Author-X-Name-First: John A.D. Author-X-Name-Last: Aston Author-Name: Nigel Stallard Author-X-Name-First: Nigel Author-X-Name-Last: Stallard Title: Adaptive Multivariate Global Testing Abstract: We present a methodology for dealing with recent challenges in testing global hypotheses using multivariate observations. The proposed tests target situations, often arising in emerging applications of neuroimaging, where the sample size n is relatively small compared with the observations' dimension K. We employ adaptive designs allowing for sequential modifications of the test statistics adapting to accumulated data. The adaptations are optimal in the sense of maximizing the predictive power of the test at each interim analysis while still controlling the Type I error. Optimality is obtained by a general result applicable to typical adaptive design settings. Further, we prove that the potentially high-dimensional design space of the tests can be reduced to a low-dimensional projection space enabling us to perform simpler power analysis studies, including comparisons to alternative tests. We illustrate the substantial improvement in efficiency that the proposed tests can make over standard tests, especially in the case of n smaller or slightly larger than K. The methods are also studied empirically using both simulated data and data from an EEG study, where the use of prior knowledge substantially increases the power of the test. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 613-623 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.870905 File-URL: http://hdl.handle.net/10.1080/01621459.2013.870905 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:613-623 Template-Type: ReDIF-Article 1.0 Author-Name: Jing Lei Author-X-Name-First: Jing Author-X-Name-Last: Lei Title: Adaptive Global Testing for Functional Linear Models Abstract: This article studies global testing of the slope function in functional linear regression models. A major challenge in functional global testing is to choose the dimension of projection when approximating the functional regression model by a finite dimensional multivariate linear regression model. We develop a new method that simultaneously tests the slope vectors in a sequence of functional principal components regression models. The sequence of models being tested is determined by the sample size and is an integral part of the testing procedure. Our theoretical analysis shows that the proposed method is uniformly powerful over a class of smooth alternatives when the signal to noise ratio exceeds the detection boundary. The methods and results reflect the deep connection between the functional linear regression model and the Gaussian sequence model. We also present an extensive simulation study and a real data example to illustrate the finite sample performance of our method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 624-634 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.856794 File-URL: http://hdl.handle.net/10.1080/01621459.2013.856794 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:624-634 Template-Type: ReDIF-Article 1.0 Author-Name: Xu Liu Author-X-Name-First: Xu Author-X-Name-Last: Liu Author-Name: Hongmei Jiang Author-X-Name-First: Hongmei Author-X-Name-Last: Jiang Author-Name: Yong Zhou Author-X-Name-First: Yong Author-X-Name-Last: Zhou Title: Local Empirical Likelihood Inference for Varying-Coefficient Density-Ratio Models Based on Case-Control Data Abstract: In this article, we develop a varying-coefficient density-ratio model for case-control studies. The case and control samples come from two different distributions. Under the model assumption, the ratio of the two densities is related to the linear combination of covariates with varying coefficients through a known function. A special case is the exponential tilt model where the log ratio of the two densities is a linear function of covariates. We propose a local empirical likelihood (EL) approach to estimate the nonparametric coefficient functions. Under some regularity assumptions, the proposed estimators are shown to be consistent and asymptotically normally distributed. The sieve empirical likelihood ratio (SELR) test statistic for detecting whether the varying-coefficients are really constant and other related hypotheses is constructed and it follows approximately a chi-squared distribution. We introduce a modified bootstrap procedure to estimate the null distribution of the SELR when sample size is small. We also examine the performance of proposed method for finite sample sizes through simulation studies and illustrate it with a real dataset. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 635-646 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.858629 File-URL: http://hdl.handle.net/10.1080/01621459.2013.858629 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:635-646 Template-Type: ReDIF-Article 1.0 Author-Name: Bruno Scarpa Author-X-Name-First: Bruno Author-X-Name-Last: Scarpa Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Enriched Stick-Breaking Processes for Functional Data Abstract: In many applications involving functional data, prior information is available about the proportion of curves having different attributes. It is not straightforward to include such information in existing procedures for functional data analysis. Generalizing the functional Dirichlet process (FDP), we propose a class of stick-breaking priors for distributions of functions. These priors incorporate functional atoms drawn from constrained stochastic processes. The stick-breaking weights are specified to allow user-specified prior probabilities for curve attributes, with hyperpriors accommodating uncertainty. Compared with the FDP, the random distribution is enriched for curves having attributes known to be common. Theoretical properties are considered, methods are developed for posterior computation, and the approach is illustrated using data on temperature curves in menstrual cycles. Journal: Journal of the American Statistical Association Pages: 647-660 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.866564 File-URL: http://hdl.handle.net/10.1080/01621459.2013.866564 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:647-660 Template-Type: ReDIF-Article 1.0 Author-Name: Shuzhuan Zheng Author-X-Name-First: Shuzhuan Author-X-Name-Last: Zheng Author-Name: Lijian Yang Author-X-Name-First: Lijian Author-X-Name-Last: Yang Author-Name: Wolfgang K. Härdle Author-X-Name-First: Wolfgang K. Author-X-Name-Last: Härdle Title: A Smooth Simultaneous Confidence Corridor for the Mean of Sparse Functional Data Abstract: Functional data analysis (FDA) has become an important area of statistics research in the recent decade, yet a smooth simultaneous confidence corridor (SCC) does not exist in the literature for the mean function of sparse functional data. SCC is a powerful tool for making statistical inference on an entire unknown function, nonetheless classic "Hungarian embedding" techniques for establishing asymptotic correctness of SCC completely fail for sparse functional data. We propose a local linear SCC and a shoal of confidence intervals (SCI) for the mean function of sparse functional data, and establish that it is asymptotically equivalent to the SCC of independent regression data, using new results from Gaussian process extreme value theory. The SCC procedure is examined in simulations for its superior theoretical accuracy and performance, and used to analyze growth curve data, confirming findings with quantified high significance levels. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 661-673 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.866899 File-URL: http://hdl.handle.net/10.1080/01621459.2013.866899 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:661-673 Template-Type: ReDIF-Article 1.0 Author-Name: Roger Koenker Author-X-Name-First: Roger Author-X-Name-Last: Koenker Author-Name: Ivan Mizera Author-X-Name-First: Ivan Author-X-Name-Last: Mizera Title: Convex Optimization, Shape Constraints, Compound Decisions, and Empirical Bayes Rules Abstract: Estimation of mixture densities for the classical Gaussian compound decision problem and their associated (empirical) Bayes rules is considered from two new perspectives. The first, motivated by Brown and Greenshtein, introduces a nonparametric maximum likelihood estimator of the mixture density subject to a monotonicity constraint on the resulting Bayes rule. The second, motivated by Jiang and Zhang, proposes a new approach to computing the Kiefer-Wolfowitz nonparametric maximum likelihood estimator for mixtures. In contrast to prior methods for these problems, our new approaches are cast as convex optimization problems that can be efficiently solved by modern interior point methods. In particular, we show that the reformulation of the Kiefer-Wolfowitz estimator as a convex optimization problem reduces the computational effort by several orders of magnitude for typical problems, by comparison to prior EM-algorithm based methods, and thus greatly expands the practical applicability of the resulting methods. Our new procedures are compared with several existing empirical Bayes methods in simulations employing the well-established design of Johnstone and Silverman. Some further comparisons are made based on prediction of baseball batting averages. A Bernoulli mixture application is briefly considered in the penultimate section. Journal: Journal of the American Statistical Association Pages: 674-685 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.869224 File-URL: http://hdl.handle.net/10.1080/01621459.2013.869224 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:674-685 Template-Type: ReDIF-Article 1.0 Author-Name: Hua Zhou Author-X-Name-First: Hua Author-X-Name-Last: Zhou Author-Name: Yichao Wu Author-X-Name-First: Yichao Author-X-Name-Last: Wu Title: A Generic Path Algorithm for Regularized Statistical Estimation Abstract: Regularization is widely used in statistics and machine learning to prevent overfitting and gear solution toward prior information. In general, a regularized estimation problem minimizes the sum of a loss function and a penalty term. The penalty term is usually weighted by a tuning parameter and encourages certain constraints on the parameters to be estimated. Particular choices of constraints lead to the popular lasso, fused-lasso, and other generalized ℓ1 penalized regression methods. In this article we follow a recent idea by Wu and propose an exact path solver based on ordinary differential equations (EPSODE) that works for any convex loss function and can deal with generalized ℓ1 penalties as well as more complicated regularization such as inequality constraints encountered in shape-restricted regressions and nonparametric density estimation. Nonasymptotic error bounds for the equality regularized estimates are derived. In practice, the EPSODE can be coupled with AIC, BIC, Cp or cross-validation to select an optimal tuning parameter, or provide a convenient model space for performing model averaging or aggregation. Our applications to generalized ℓ1 regularized generalized linear models, shape-restricted regressions, Gaussian graphical models, and nonparametric density estimation showcase the potential of the EPSODE algorithm. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 686-699 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.864166 File-URL: http://hdl.handle.net/10.1080/01621459.2013.864166 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:686-699 Template-Type: ReDIF-Article 1.0 Author-Name: Hulin Wu Author-X-Name-First: Hulin Author-X-Name-Last: Wu Author-Name: Tao Lu Author-X-Name-First: Tao Author-X-Name-Last: Lu Author-Name: Hongqi Xue Author-X-Name-First: Hongqi Author-X-Name-Last: Xue Author-Name: Hua Liang Author-X-Name-First: Hua Author-X-Name-Last: Liang Title: Sparse Additive Ordinary Differential Equations for Dynamic Gene Regulatory Network Modeling Abstract: The gene regulation network (GRN) is a high-dimensional complex system, which can be represented by various mathematical or statistical models. The ordinary differential equation (ODE) model is one of the popular dynamic GRN models. High-dimensional linear ODE models have been proposed to identify GRNs, but with a limitation of the linear regulation effect assumption. In this article, we propose a sparse additive ODE (SA-ODE) model, coupled with ODE estimation methods and adaptive group least absolute shrinkage and selection operator (LASSO) techniques, to model dynamic GRNs that could flexibly deal with nonlinear regulation effects. The asymptotic properties of the proposed method are established and simulation studies are performed to validate the proposed approach. An application example for identifying the nonlinear dynamic GRN of T-cell activation is used to illustrate the usefulness of the proposed method. Journal: Journal of the American Statistical Association Pages: 700-716 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.859617 File-URL: http://hdl.handle.net/10.1080/01621459.2013.859617 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:700-716 Template-Type: ReDIF-Article 1.0 Author-Name: Aurore Delaigle Author-X-Name-First: Aurore Author-X-Name-Last: Delaigle Author-Name: Peter Hall Author-X-Name-First: Peter Author-X-Name-Last: Hall Title: Parametrically Assisted Nonparametric Estimation of a Density in the Deconvolution Problem Abstract: Nonparametric estimation of a density from contaminated data is a difficult problem, for which convergence rates are notoriously slow. We introduce parametrically assisted nonparametric estimators which can dramatically improve on the performance of standard nonparametric estimators when the assumed model is close to the true density, without degrading much the quality of purely nonparametric estimators in other cases. We establish optimal convergence rates for our problem and discuss estimators that attain these rates. The very good numerical properties of the methods are illustrated via a simulation study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 717-729 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.857611 File-URL: http://hdl.handle.net/10.1080/01621459.2013.857611 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:717-729 Template-Type: ReDIF-Article 1.0 Author-Name: Forrest W. Crawford Author-X-Name-First: Forrest W. Author-X-Name-Last: Crawford Author-Name: Vladimir N. Minin Author-X-Name-First: Vladimir N. Author-X-Name-Last: Minin Author-Name: Marc A. Suchard Author-X-Name-First: Marc A. Author-X-Name-Last: Suchard Title: Estimation for General Birth-Death Processes Abstract: Birth-death processes (BDPs) are continuous-time Markov chains that track the number of "particles" in a system over time. While widely used in population biology, genetics, and ecology, statistical inference of the instantaneous particle birth and death rates remains largely limited to restrictive linear BDPs in which per-particle birth and death rates are constant. Researchers often observe the number of particles at discrete times, necessitating data augmentation procedures such as expectation-maximization (EM) to find maximum likelihood estimates (MLEs). For BDPs on finite state-spaces, there are powerful matrix methods for computing the conditional expectations needed for the E-step of the EM algorithm. For BDPs on infinite state-spaces, closed-form solutions for the E-step are available for some linear models, but most previous work has resorted to time-consuming simulation. Remarkably, we show that the E-step conditional expectations can be expressed as convolutions of computable transition probabilities for any general BDP with arbitrary rates. This important observation, along with a convenient continued fraction representation of the Laplace transforms of the transition probabilities, allows for novel and efficient computation of the conditional expectations for all BDPs, eliminating the need for truncation of the state-space or costly simulation. We use this insight to derive EM algorithms that yield maximum likelihood estimation for general BDPs characterized by various rate models, including generalized linear models (GLM). We show that our Laplace convolution technique outperforms competing methods when they are available and demonstrate a technique to accelerate EM algorithm convergence. We validate our approach using synthetic data and then apply our methods to cancer cell growth and estimation of mutation parameters in microsatellite evolution. Journal: Journal of the American Statistical Association Pages: 730-747 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.866565 File-URL: http://hdl.handle.net/10.1080/01621459.2013.866565 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:730-747 Template-Type: ReDIF-Article 1.0 Author-Name: Marco Di Marzio Author-X-Name-First: Marco Author-X-Name-Last: Di Marzio Author-Name: Agnese Panzera Author-X-Name-First: Agnese Author-X-Name-Last: Panzera Author-Name: Charles C. Taylor Author-X-Name-First: Charles C. Author-X-Name-Last: Taylor Title: Nonparametric Regression for Spherical Data Abstract: We develop nonparametric smoothing for regression when both the predictor and the response variables are defined on a sphere of whatever dimension. A local polynomial fitting approach is pursued, which retains all the advantages in terms of rate optimality, interpretability, and ease of implementation widely observed in the standard setting. Our estimates have a multi-output nature, meaning that each coordinate is separately estimated, within a scheme of a regression with a linear response. The main properties include linearity and rotational equivariance. This research has been motivated by the fact that very few models describe this kind of regression. Such current methods are surely not widely employable since they have a parametric nature, and also require the same dimensionality for prediction and response spaces, along with nonrandom design. Our approach does not suffer these limitations. Real-data case studies and simulation experiments are used to illustrate the effectiveness of the method. Journal: Journal of the American Statistical Association Pages: 748-763 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.866567 File-URL: http://hdl.handle.net/10.1080/01621459.2013.866567 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:748-763 Template-Type: ReDIF-Article 1.0 Author-Name: Miguel de Carvalho Author-X-Name-First: Miguel Author-X-Name-Last: de Carvalho Author-Name: Anthony C. Davison Author-X-Name-First: Anthony C. Author-X-Name-Last: Davison Title: Spectral Density Ratio Models for Multivariate Extremes Abstract: The modeling of multivariate extremes has received increasing recent attention because of its importance in risk assessment. In classical statistics of extremes, the joint distribution of two or more extremes has a nonparametric form, subject to moment constraints. This article develops a semiparametric model for the situation where several multivariate extremal distributions are linked through the action of a covariate on an unspecified baseline distribution, through a so-called density ratio model. Theoretical and numerical aspects of empirical likelihood inference for this model are discussed, and an application is given to pairs of extreme forest temperatures. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 764-776 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.872651 File-URL: http://hdl.handle.net/10.1080/01621459.2013.872651 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:764-776 Template-Type: ReDIF-Article 1.0 Author-Name: Chao Wang Author-X-Name-First: Chao Author-X-Name-Last: Wang Author-Name: Heng Liu Author-X-Name-First: Heng Author-X-Name-Last: Liu Author-Name: Jian-Feng Yao Author-X-Name-First: Jian-Feng Author-X-Name-Last: Yao Author-Name: Richard A. Davis Author-X-Name-First: Richard A. Author-X-Name-Last: Davis Author-Name: Wai Keung Li Author-X-Name-First: Wai Keung Author-X-Name-Last: Li Title: Self-Excited Threshold Poisson Autoregression Abstract: This article studies theory and inference of an observation-driven model for time series of counts. It is assumed that the observations follow a Poisson distribution conditioned on an accompanying intensity process, which is equipped with a two-regime structure according to the magnitude of the lagged observations. Generalized from the Poisson autoregression, it allows more flexible, and even negative correlation, in the observations, which cannot be produced by the single-regime model. Classical Markov chain theory and Lyapunov's method are used to derive the conditions under which the process has a unique invariant probability measure and to show a strong law of large numbers of the intensity process. Moreover, the asymptotic theory of the maximum likelihood estimates of the parameters is established. A simulation study and a real-data application are considered, where the model is applied to the number of major earthquakes in the world. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 777-787 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.872994 File-URL: http://hdl.handle.net/10.1080/01621459.2013.872994 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:777-787 Template-Type: ReDIF-Article 1.0 Author-Name: Zongwu Cai Author-X-Name-First: Zongwu Author-X-Name-Last: Cai Author-Name: Xian Wang Author-X-Name-First: Xian Author-X-Name-Last: Wang Title: Selection of Mixed Copula Model via Penalized Likelihood Abstract: A fundamental issue of applying a copula method in applications is how to choose an appropriate copula function for a given problem. In this article we address this issue by proposing a new copula selection approach via penalized likelihood plus a shrinkage operator. The proposed method selects an appropriate copula function and estimates the related parameters simultaneously. We establish the asymptotic properties of the proposed penalized likelihood estimator, including the rate of convergence and asymptotic normality and abnormality. Particularly, when the true coefficient parameters may be on the boundary of the parameter space and the dependence parameters are in an unidentified subset of the parameter space, we show that the limiting distribution for boundary parameter estimator is half-normal and the penalized likelihood estimator for unidentified parameter converges to an arbitrary value. Finally, Monte Carlo simulation studies are carried out to illustrate the finite sample performance of the proposed approach and the proposed method is used to investigate the correlation structure and comovement of financial stock markets. Journal: Journal of the American Statistical Association Pages: 788-801 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.873366 File-URL: http://hdl.handle.net/10.1080/01621459.2013.873366 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:788-801 Template-Type: ReDIF-Article 1.0 Author-Name: Antonio Lijoi Author-X-Name-First: Antonio Author-X-Name-Last: Lijoi Author-Name: Bernardo Nipoti Author-X-Name-First: Bernardo Author-X-Name-Last: Nipoti Title: A Class of Hazard Rate Mixtures for Combining Survival Data From Different Experiments Abstract: Mixture models for hazard rate functions are widely used tools for addressing the statistical analysis of survival data subject to a censoring mechanism. The present article introduced a new class of vectors of random hazard rate functions that are expressed as kernel mixtures of dependent completely random measures. This leads to define dependent nonparametric prior processes that are suitably tailored to draw inferences in the presence of heterogenous observations. Besides its flexibility, an important appealing feature of our proposal is analytical tractability: we are, indeed, able to determine some relevant distributional properties and a posterior characterization that is also the key for devising an efficient Markov chain Monte Carlo sampler. For illustrative purposes, we specialize our general results to a class of dependent extended gamma processes. We finally display a few numerical examples, including both simulated and real two-sample datasets: these allow us to identify the effect of a borrowing strength phenomenon and provide evidence of the effectiveness of the prior to deal with datasets for which the proportional hazards assumption does not hold true. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 802-814 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.869499 File-URL: http://hdl.handle.net/10.1080/01621459.2013.869499 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:802-814 Template-Type: ReDIF-Article 1.0 Author-Name: R. Dennis Cook Author-X-Name-First: R. Dennis Author-X-Name-Last: Cook Author-Name: Xin Zhang Author-X-Name-First: Xin Author-X-Name-Last: Zhang Title: Fused Estimators of the Central Subspace in Sufficient Dimension Reduction Abstract: When studying the regression of a univariate variable Y on a vector x of predictors, most existing sufficient dimension-reduction (SDR) methods require the construction of slices of Y to estimate moments of the conditional distribution of X given Y. But there is no widely accepted method for choosing the number of slices, while a poorly chosen slicing scheme may produce miserable results. We propose a novel and easily implemented fusing method that can mitigate the problem of choosing a slicing scheme and improve estimation efficiency at the same time. We develop two fused estimators-called FIRE and DIRE-based on an optimal inverse regression estimator. The asymptotic variance of FIRE is no larger than that of the original methods regardless of the choice of slicing scheme, while DIRE is less computational intense and more robust. Simulation studies show that the fused estimators perform effectively the same as or substantially better than the parent methods. Fused estimators based on other methods can be developed in parallel: fused sliced inverse regression (SIR), fused central solution space (CSS)-SIR, and fused likelihood-based method (LAD) are introduced briefly. Simulation studies of the fused CSS-SIR and fused LAD estimators show substantial gain over their parent methods. A real data example is also presented for illustration and comparison. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 815-827 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.866563 File-URL: http://hdl.handle.net/10.1080/01621459.2013.866563 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:815-827 Template-Type: ReDIF-Article 1.0 Author-Name: Veronika Ročková Author-X-Name-First: Veronika Author-X-Name-Last: Ročková Author-Name: Edward I. George Author-X-Name-First: Edward I. Author-X-Name-Last: George Title: EMVS: The EM Approach to Bayesian Variable Selection Abstract: Despite rapid developments in stochastic search algorithms, the practicality of Bayesian variable selection methods has continued to pose challenges. High-dimensional data are now routinely analyzed, typically with many more covariates than observations. To broaden the applicability of Bayesian variable selection for such high-dimensional linear regression contexts, we propose EMVS, a deterministic alternative to stochastic search based on an EM algorithm which exploits a conjugate mixture prior formulation to quickly find posterior modes. Combining a spike-and-slab regularization diagram for the discovery of active predictor sets with subsequent rigorous evaluation of posterior model probabilities, EMVS rapidly identifies promising sparse high posterior probability submodels. External structural information such as likely covariate groupings or network topologies is easily incorporated into the EMVS framework. Deterministic annealing variants are seen to improve the effectiveness of our algorithms by mitigating the posterior multimodality associated with variable selection priors. The usefulness of the EMVS approach is demonstrated on real high-dimensional data, where computational complexity renders stochastic search to be less practical. Journal: Journal of the American Statistical Association Pages: 828-846 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.869223 File-URL: http://hdl.handle.net/10.1080/01621459.2013.869223 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:828-846 Template-Type: ReDIF-Article 1.0 Author-Name: Faming Liang Author-X-Name-First: Faming Author-X-Name-Last: Liang Author-Name: Yichen Cheng Author-X-Name-First: Yichen Author-X-Name-Last: Cheng Author-Name: Guang Lin Author-X-Name-First: Guang Author-X-Name-Last: Lin Title: Simulated Stochastic Approximation Annealing for Global Optimization With a Square-Root Cooling Schedule Abstract: Simulated annealing has been widely used in the solution of optimization problems. As known by many researchers, the global optima cannot be guaranteed to be located by simulated annealing unless a logarithmic cooling schedule is used. However, the logarithmic cooling schedule is so slow that no one can afford to use this much CPU time. This article proposes a new stochastic optimization algorithm, the so-called simulated stochastic approximation annealing algorithm, which is a combination of simulated annealing and the stochastic approximation Monte Carlo algorithm. Under the framework of stochastic approximation, it is shown that the new algorithm can work with a cooling schedule in which the temperature can decrease much faster than in the logarithmic cooling schedule, for example, a square-root cooling schedule, while guaranteeing the global optima to be reached when the temperature tends to zero. The new algorithm has been tested on a few benchmark optimization problems, including feed-forward neural network training and protein-folding. The numerical results indicate that the new algorithm can significantly outperform simulated annealing and other competitors. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 847-863 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2013.872993 File-URL: http://hdl.handle.net/10.1080/01621459.2013.872993 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:847-863 Template-Type: ReDIF-Article 1.0 Author-Name: L. Chen Author-X-Name-First: L. Author-X-Name-Last: Chen Author-Name: W. W. Dou Author-X-Name-First: W. W. Author-X-Name-Last: Dou Author-Name: Z. Qiao Author-X-Name-First: Z. Author-X-Name-Last: Qiao Title: "Ensemble Subsampling for Imbalanced Multivariate Two-Sample Tests," Journal: Journal of the American Statistical Association Pages: 871-871 Issue: 506 Volume: 109 Year: 2014 Month: 6 X-DOI: 10.1080/01621459.2014.899497 File-URL: http://hdl.handle.net/10.1080/01621459.2014.899497 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:871-871 Template-Type: ReDIF-Article 1.0 Author-Name: Kassandra Fronczyk Author-X-Name-First: Kassandra Author-X-Name-Last: Fronczyk Author-Name: Athanasios Kottas Author-X-Name-First: Athanasios Author-X-Name-Last: Kottas Title: A Bayesian Nonparametric Modeling Framework for Developmental Toxicity Studies Abstract: We develop a Bayesian nonparametric mixture modeling framework for replicated count responses in dose-response settings. We explore this methodology for modeling and risk assessment in developmental toxicity studies, where the primary objective is to determine the relationship between the level of exposure to a toxic chemical and the probability of a physiological or biochemical response, or death. Data from these experiments typically involve features that cannot be captured by standard parametric approaches. To provide flexibility in the functional form of both the response distribution and the probability of positive response, the proposed mixture model is built from a dependent Dirichlet process prior, with the dependence of the mixing distributions governed by the dose level. The methodology is tested with a simulation study, which involves also comparison with semiparametric Bayesian approaches to highlight the practical utility of the dependent Dirichlet process nonparametric mixture model. Further illustration is provided through the analysis of data from two developmental toxicity studies. Journal: Journal of the American Statistical Association Pages: 873-888 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2013.830445 File-URL: http://hdl.handle.net/10.1080/01621459.2013.830445 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:873-888 Template-Type: ReDIF-Article 1.0 Author-Name: Peter Müller Author-X-Name-First: Peter Author-X-Name-Last: Müller Author-Name: Fernando Quintana Author-X-Name-First: Fernando Author-X-Name-Last: Quintana Title: Comment Journal: Journal of the American Statistical Association Pages: 889-889 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.955987 File-URL: http://hdl.handle.net/10.1080/01621459.2014.955987 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:889-889 Template-Type: ReDIF-Article 1.0 Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Comment Journal: Journal of the American Statistical Association Pages: 890-891 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.955988 File-URL: http://hdl.handle.net/10.1080/01621459.2014.955988 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:890-891 Template-Type: ReDIF-Article 1.0 Author-Name: Kassandra Fronczyk Author-X-Name-First: Kassandra Author-X-Name-Last: Fronczyk Author-Name: Athanasios Kottas Author-X-Name-First: Athanasios Author-X-Name-Last: Kottas Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 891-893 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.932171 File-URL: http://hdl.handle.net/10.1080/01621459.2014.932171 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:891-893 Template-Type: ReDIF-Article 1.0 Author-Name: Matthew W. Wheeler Author-X-Name-First: Matthew W. Author-X-Name-Last: Wheeler Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Author-Name: Sudha P. Pandalai Author-X-Name-First: Sudha P. Author-X-Name-Last: Pandalai Author-Name: Brent A. Baker Author-X-Name-First: Brent A. Author-X-Name-Last: Baker Author-Name: Amy H. Herring Author-X-Name-First: Amy H. Author-X-Name-Last: Herring Title: Mechanistic Hierarchical Gaussian Processes Abstract: The statistics literature on functional data analysis focuses primarily on flexible black-box approaches, which are designed to allow individual curves to have essentially any shape while characterizing variability. Such methods typically cannot incorporate mechanistic information, which is commonly expressed in terms of differential equations. Motivated by studies of muscle activation, we propose a nonparametric Bayesian approach that takes into account mechanistic understanding of muscle physiology. A novel class of hierarchical Gaussian processes is defined that favors curves consistent with differential equations defined on motor, damper, spring systems. A Gibbs sampler is proposed to sample from the posterior distribution and applied to a study of rats exposed to noninjurious muscle activation protocols. Although motivated by muscle force data, a parallel approach can be used to include mechanistic information in broad functional data analysis applications. Journal: Journal of the American Statistical Association Pages: 894-904 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.899234 File-URL: http://hdl.handle.net/10.1080/01621459.2014.899234 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:894-904 Template-Type: ReDIF-Article 1.0 Author-Name: Yuan Jiang Author-X-Name-First: Yuan Author-X-Name-Last: Jiang Author-Name: Ni Li Author-X-Name-First: Ni Author-X-Name-Last: Li Author-Name: Heping Zhang Author-X-Name-First: Heping Author-X-Name-Last: Zhang Title: Identifying Genetic Variants for Addiction via Propensity Score Adjusted Generalized Kendall's Tau Abstract: Identifying replicable genetic variants for addiction has been extremely challenging. Besides the common difficulties with genome-wide association studies (GWAS), environmental factors are known to be critical to addiction, and comorbidity is widely observed. Despite the importance of environmental factors and comorbidity for addiction study, few GWAS analyses adequately considered them due to the limitations of the existing statistical methods. Although parametric methods have been developed to adjust for covariates in association analysis, difficulties arise when the traits are multivariate because there is no ready-to-use model for them. Recent nonparametric development includes U-statistics to measure the phenotype-genotype association weighted by a similarity score of covariates. However, it is not clear how to optimize the similarity score. Therefore, we propose a semiparametric method to measure the association adjusted by covariates. In our approach, the nonparametric U-statistic is adjusted by parametric estimates of propensity scores using the idea of inverse probability weighting. The new measurement is shown to be asymptotically unbiased under our null hypothesis while the previous nonweighted and weighted ones are not. Simulation results show that our test improves power as opposed to the nonweighted and two other weighted U-statistic methods, and it is particularly powerful for detecting gene-environment interactions. Finally, we apply our proposed test to the Study of Addiction: Genetics and Environment (SAGE) to identify genetic variants for addiction. Novel genetic variants are found from our analysis, which warrant further investigation in the future. Journal: Journal of the American Statistical Association Pages: 905-930 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.901223 File-URL: http://hdl.handle.net/10.1080/01621459.2014.901223 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:905-930 Template-Type: ReDIF-Article 1.0 Author-Name: Peter F. Thall Author-X-Name-First: Peter F. Author-X-Name-Last: Thall Author-Name: Hoang Q. Nguyen Author-X-Name-First: Hoang Q. Author-X-Name-Last: Nguyen Author-Name: Sarah Zohar Author-X-Name-First: Sarah Author-X-Name-Last: Zohar Author-Name: Pierre Maton Author-X-Name-First: Pierre Author-X-Name-Last: Maton Title: Optimizing Sedative Dose in Preterm Infants Undergoing Treatment for Respiratory Distress Syndrome Abstract: The intubation-surfactant-extubation (INSURE) procedure is used worldwide to treat preterm newborn infants suffering from respiratory distress syndrome, which is caused by an insufficient amount of the chemical surfactant in the lungs. With INSURE, the infant is intubated, surfactant is administered via the tube to the trachea, and at completion the infant is extubated. This improves the infant's ability to breathe and thus decreases the risk of long-term neurological or motor disabilities. To perform the intubation safely, the newborn infant first must be sedated. Despite extensive experience with INSURE, there is no consensus on what sedative dose is best. This article describes a Bayesian sequentially adaptive design for a multi-institution clinical trial to optimize the sedative dose given to preterm infants undergoing the INSURE procedure. The design is based on three clinical outcomes, two efficacy and one adverse, using elicited numerical utilities of the eight possible elementary outcomes. A flexible Bayesian parametric trivariate dose-outcome model is assumed, with the prior derived from elicited mean outcome probabilities. Doses are chosen adaptively for successive cohorts of infants using posterior mean utilities, subject to safety and efficacy constraints. A computer simulation study of the design is presented. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 931-943 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.904789 File-URL: http://hdl.handle.net/10.1080/01621459.2014.904789 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:931-943 Template-Type: ReDIF-Article 1.0 Author-Name: Radu Herbei Author-X-Name-First: Radu Author-X-Name-Last: Herbei Author-Name: L. Mark Berliner Author-X-Name-First: L. Mark Author-X-Name-Last: Berliner Title: Estimating Ocean Circulation: An MCMC Approach With Approximated Likelihoods via the Bernoulli Factory Abstract: We provide a Bayesian analysis of ocean circulation based on data collected in the South Atlantic Ocean. The analysis incorporates a reaction-diffusion partial differential equation that is not solvable in closed form. This leads to an intractable likelihood function. We describe a novel Markov chain Monte Carlo approach that does not require a likelihood evaluation. Rather, we use unbiased estimates of the likelihood and a Bernoulli factory to decide whether or not proposed states are accepted. The variates required to estimate the likelihood function are obtained via a Feynman-Kac representation. This lifts the common restriction of selecting a regular grid for the physical model and eliminates the need for data preprocessing. We implement our approach using the parallel graphic processing unit (GPU) computing environment. Journal: Journal of the American Statistical Association Pages: 944-954 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.914439 File-URL: http://hdl.handle.net/10.1080/01621459.2014.914439 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:944-954 Template-Type: ReDIF-Article 1.0 Author-Name: Celine Marielle Laffont Author-X-Name-First: Celine Marielle Author-X-Name-Last: Laffont Author-Name: Marc Vandemeulebroecke Author-X-Name-First: Marc Author-X-Name-Last: Vandemeulebroecke Author-Name: Didier Concordet Author-X-Name-First: Didier Author-X-Name-Last: Concordet Title: Multivariate Analysis of Longitudinal Ordinal Data With Mixed Effects Models, With Application to Clinical Outcomes in Osteoarthritis Abstract: Our objective was to evaluate the efficacy of robenacoxib in osteoarthritic dogs using four ordinal responses measured repeatedly over time. We propose a multivariate probit mixed effects model to describe the joint evolution of endpoints and to evidence the intrinsic correlations between responses that are not due to treatment effect. Maximum likelihood computation is intractable within reasonable time frames. We therefore use a pairwise modeling approach in combination with a stochastic EM algorithm. Multidimensional ordinal responses with longitudinal measurements are a common feature in clinical trials. However, the standard methods for data analysis use unidimensional models, resulting in a loss of information. Our methodology provides substantially greater insight than these methods for the evaluation of treatment effects and shows a good performance at low computational cost. We thus believe that it could be used in routine practice to optimize the evaluation of treatment efficacy. Journal: Journal of the American Statistical Association Pages: 955-966 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.917977 File-URL: http://hdl.handle.net/10.1080/01621459.2014.917977 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:955-966 Template-Type: ReDIF-Article 1.0 Author-Name: Michael E. Sobel Author-X-Name-First: Michael E. Author-X-Name-Last: Sobel Author-Name: Martin A. Lindquist Author-X-Name-First: Martin A. Author-X-Name-Last: Lindquist Title: Causal Inference for fMRI Time Series Data With Systematic Errors of Measurement in a Balanced On/Off Study of Social Evaluative Threat Abstract: Functional magnetic resonance imaging (fMRI) has facilitated major advances in understanding human brain function. Neuroscientists are interested in using fMRI to study the effects of external stimuli on brain activity and causal relationships among brain regions, but have not stated what is meant by causation or defined the effects they purport to estimate. Building on Rubin's causal model, we construct a framework for causal inference using blood oxygenation level dependent (BOLD) fMRI time series data. In the usual statistical literature on causal inference, potential outcomes, assumed to be measured without systematic error, are used to define unit and average causal effects. However, in general the potential BOLD responses are measured with stimulus dependent systematic error. Thus we define unit and average causal effects that are free of systematic error. In contrast to the usual case of a randomized experiment where adjustment for intermediate outcomes leads to biased estimates of treatment effects, here the failure to adjust for task dependent systematic error leads to biased estimates. We therefore adjust for systematic error using measured "noise covariates," using a linear mixed model to estimate the effects and the systematic error. Our results are important for neuroscientists, who typically do not adjust for systematic error. They should also prove useful to researchers in other areas where responses are measured with error and in fields where large amounts of data are collected on relatively few subjects. To illustrate our approach, we reanalyze data from a social evaluative threat task, comparing the findings with results that ignore systematic error. Journal: Journal of the American Statistical Association Pages: 967-976 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.922886 File-URL: http://hdl.handle.net/10.1080/01621459.2014.922886 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:967-976 Template-Type: ReDIF-Article 1.0 Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Author-Name: Zakaria Khondker Author-X-Name-First: Zakaria Author-X-Name-Last: Khondker Author-Name: Zhaohua Lu Author-X-Name-First: Zhaohua Author-X-Name-Last: Lu Author-Name: Joseph G. Ibrahim Author-X-Name-First: Joseph G. Author-X-Name-Last: Ibrahim Title: Bayesian Generalized Low Rank Regression Models for Neuroimaging Phenotypes and Genetic Markers Abstract: We propose a Bayesian generalized low-rank regression model (GLRR) for the analysis of both high-dimensional responses and covariates. This development is motivated by performing searches for associations between genetic variants and brain imaging phenotypes. GLRR integrates a low rank matrix to approximate the high-dimensional regression coefficient matrix of GLRR and a dynamic factor model to model the high-dimensional covariance matrix of brain imaging phenotypes. Local hypothesis testing is developed to identify significant covariates on high-dimensional responses. Posterior computation proceeds via an efficient Markov chain Monte Carlo algorithm. A simulation study is performed to evaluate the finite sample performance of GLRR and its comparison with several competing approaches. We apply GLRR to investigate the impact of 1071 SNPs on top 40 genes reported by AlzGene database on the volumes of 93 regions of interest (ROI) obtained from Alzheimer's Disease Neuroimaging Initiative (ADNI). Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 977-990 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.923775 File-URL: http://hdl.handle.net/10.1080/01621459.2014.923775 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:977-990 Template-Type: ReDIF-Article 1.0 Author-Name: Bradley Efron Author-X-Name-First: Bradley Author-X-Name-Last: Efron Title: Estimation and Accuracy After Model Selection Abstract: Classical statistical theory ignores model selection in assessing estimation accuracy. Here we consider bootstrap methods for computing standard errors and confidence intervals that take model selection into account. The methodology involves bagging, also known as bootstrap smoothing, to tame the erratic discontinuities of selection-based estimators. A useful new formula for the accuracy of bagging then provides standard errors for the smoothed estimators. Two examples, nonparametric and parametric, are carried through in detail: a regression model where the choice of degree (linear, quadratic, cubic, ...) is determined by the Cp criterion and a Lasso-based estimation problem. Journal: Journal of the American Statistical Association Pages: 991-1007 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2013.823775 File-URL: http://hdl.handle.net/10.1080/01621459.2013.823775 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:991-1007 Template-Type: ReDIF-Article 1.0 Author-Name: Lan Wang Author-X-Name-First: Lan Author-X-Name-Last: Wang Author-Name: Ben Sherwood Author-X-Name-First: Ben Author-X-Name-Last: Sherwood Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Title: Comment Journal: Journal of the American Statistical Association Pages: 1007-1010 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.905399 File-URL: http://hdl.handle.net/10.1080/01621459.2014.905399 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1007-1010 Template-Type: ReDIF-Article 1.0 Author-Name: Dimitris N. Politis Author-X-Name-First: Dimitris N. Author-X-Name-Last: Politis Title: Comment Journal: Journal of the American Statistical Association Pages: 1010-1013 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.905788 File-URL: http://hdl.handle.net/10.1080/01621459.2014.905788 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1010-1013 Template-Type: ReDIF-Article 1.0 Author-Name: Shuva Gupta Author-X-Name-First: Shuva Author-X-Name-Last: Gupta Author-Name: S. N. Lahiri Author-X-Name-First: S. N. Author-X-Name-Last: Lahiri Title: Comment Journal: Journal of the American Statistical Association Pages: 1013-1015 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.905789 File-URL: http://hdl.handle.net/10.1080/01621459.2014.905789 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1013-1015 Template-Type: ReDIF-Article 1.0 Author-Name: Andrew Gelman Author-X-Name-First: Andrew Author-X-Name-Last: Gelman Author-Name: Aki Vehtari Author-X-Name-First: Aki Author-X-Name-Last: Vehtari Title: Comment Journal: Journal of the American Statistical Association Pages: 1015-1016 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.906153 File-URL: http://hdl.handle.net/10.1080/01621459.2014.906153 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1015-1016 Template-Type: ReDIF-Article 1.0 Author-Name: Nils Lid Hjort Author-X-Name-First: Nils Lid Author-X-Name-Last: Hjort Title: Comment Journal: Journal of the American Statistical Association Pages: 1017-1020 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.923315 File-URL: http://hdl.handle.net/10.1080/01621459.2014.923315 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1017-1020 Template-Type: ReDIF-Article 1.0 Author-Name: Bradley Efron Author-X-Name-First: Bradley Author-X-Name-Last: Efron Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 1021-1022 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.932172 File-URL: http://hdl.handle.net/10.1080/01621459.2014.932172 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1021-1022 Template-Type: ReDIF-Article 1.0 Author-Name: Ke Deng Author-X-Name-First: Ke Author-X-Name-Last: Deng Author-Name: Simeng Han Author-X-Name-First: Simeng Author-X-Name-Last: Han Author-Name: Kate J. Li Author-X-Name-First: Kate J. Author-X-Name-Last: Li Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Title: Bayesian Aggregation of Order-Based Rank Data Abstract: Rank aggregation, that is, combining several ranking functions (called base rankers) to get aggregated, usually stronger rankings of a given set of items, is encountered in many disciplines. Most methods in the literature assume that base rankers of interest are equally reliable. It is very common in practice, however, that some rankers are more informative and reliable than others. It is desirable to distinguish high quality base rankers from low quality ones and treat them differently. Some methods achieve this by assigning prespecified weights to base rankers. But there are no systematic and principled strategies for designing a proper weighting scheme for a practical problem. In this article, we propose a Bayesian approach, called Bayesian aggregation of rank data (BARD), to overcome this limitation. By attaching a quality parameter to each base ranker and estimating these parameters along with the aggregation process, BARD measures reliabilities of base rankers in a quantitative way and makes use of this information to improve the aggregated ranking. In addition, we design a method to detect highly correlated rankers and to account for their information redundancy appropriately. Both simulation studies and real data applications show that BARD significantly outperforms existing methods when equality of base rankers varies greatly. Journal: Journal of the American Statistical Association Pages: 1023-1039 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2013.878660 File-URL: http://hdl.handle.net/10.1080/01621459.2013.878660 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1023-1039 Template-Type: ReDIF-Article 1.0 Author-Name: Andrew J. Womack Author-X-Name-First: Andrew J. Author-X-Name-Last: Womack Author-Name: Luis León-Novelo Author-X-Name-First: Luis Author-X-Name-Last: León-Novelo Author-Name: George Casella Author-X-Name-First: George Author-X-Name-Last: Casella Title: Inference From Intrinsic Bayes' Procedures Under Model Selection and Uncertainty Abstract: In this article, we present a fully coherent and consistent objective Bayesian analysis of the linear regression model using intrinsic priors. The intrinsic prior is a scaled mixture of g-priors and promotes shrinkage toward the subspace defined by a base (or null) model. While it has been established that the intrinsic prior provides consistent model selectors across a range of models, the posterior distribution of the model parameters has not previously been investigated. We prove that the posterior distribution of the model parameters is consistent under both model selection and model averaging when the number of regressors is fixed. Further, we derive tractable expressions for the intrinsic posterior distribution as well as sampling algorithms for both a selected model and model averaging. We compare the intrinsic prior to other mixtures of g-priors and provide details on the consistency properties of modified versions of the Zellner-Siow prior and hyper g-priors. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1040-1053 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.880348 File-URL: http://hdl.handle.net/10.1080/01621459.2014.880348 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1040-1053 Template-Type: ReDIF-Article 1.0 Author-Name: T. Tony Cai Author-X-Name-First: T. Tony Author-X-Name-Last: Cai Author-Name: Mark Low Author-X-Name-First: Mark Author-X-Name-Last: Low Author-Name: Zongming Ma Author-X-Name-First: Zongming Author-X-Name-Last: Ma Title: Adaptive Confidence Bands for Nonparametric Regression Functions Abstract: This article proposes a new formulation for the construction of adaptive confidence bands (CBs) in nonparametric function estimation problems. CBs, which have size that adapts to the smoothness of the function while guaranteeing that both the relative excess mass of the function lying outside the band and the measure of the set of points where the function lies outside the band are small. It is shown that the bands adapt over a maximum range of Lipschitz classes. The adaptive CB can be easily implemented in standard statistical software with wavelet support. We investigate the numerical performance of the procedure using both simulated and real datasets. The numerical results agree well with the theoretical analysis. The procedure can be easily modified and used for other nonparametric function estimation models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1054-1070 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2013.879260 File-URL: http://hdl.handle.net/10.1080/01621459.2013.879260 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1054-1070 Template-Type: ReDIF-Article 1.0 Author-Name: Marc Hallin Author-X-Name-First: Marc Author-X-Name-Last: Hallin Author-Name: Davy Paindaveine Author-X-Name-First: Davy Author-X-Name-Last: Paindaveine Author-Name: Thomas Verdebout Author-X-Name-First: Thomas Author-X-Name-Last: Verdebout Title: Efficient R-Estimation of Principal and Common Principal Components Abstract: We propose rank-based estimators of principal components, both in the one-sample and, under the assumption of common principal components, in the m-sample cases. Those estimators are obtained via a rank-based version of Le Cam's one-step method, combined with an estimation of cross-information quantities. Under arbitrary elliptical distributions with, in the m-sample case, possibly heterogeneous radial densities, those R-estimators remain root-n consistent and asymptotically normal, while achieving asymptotic efficiency under correctly specified radial densities. Contrary to their traditional counterparts computed from empirical covariances, they do not require any moment conditions. When based on Gaussian score functions, in the one-sample case, they uniformly dominate their classical competitors in the Pitman sense. Their AREs with respect to other robust procedures are quite high-up to 30, in the Gaussian case, with respect to minimum covariance determinant estimators. Their finite-sample performances are investigated via a Monte Carlo study. Journal: Journal of the American Statistical Association Pages: 1071-1083 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.880057 File-URL: http://hdl.handle.net/10.1080/01621459.2014.880057 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1071-1083 Template-Type: ReDIF-Article 1.0 Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Linglong Kong Author-X-Name-First: Linglong Author-X-Name-Last: Kong Title: Spatially Varying Coefficient Model for Neuroimaging Data With Jump Discontinuities Abstract: Motivated by recent work on studying massive imaging data in various neuroimaging studies, we propose a novel spatially varying coefficient model (SVCM) to capture the varying association between imaging measures in a three-dimensional volume (or two-dimensional surface) with a set of covariates. Two stylized features of neuorimaging data are the presence of multiple piecewise smooth regions with unknown edges and jumps and substantial spatial correlations. To specifically account for these two features, SVCM includes a measurement model with multiple varying coefficient functions, a jumping surface model for each varying coefficient function, and a functional principal component model. We develop a three-stage estimation procedure to simultaneously estimate the varying coefficient functions and the spatial correlations. The estimation procedure includes a fast multiscale adaptive estimation and testing procedure to independently estimate each varying coefficient function, while preserving its edges among different piecewise-smooth regions. We systematically investigate the asymptotic properties (e.g., consistency and asymptotic normality) of the multiscale adaptive parameter estimates. We also establish the uniform convergence rate of the estimated spatial covariance function and its associated eigenvalues and eigenfunctions. Our Monte Carlo simulation and real-data analysis have confirmed the excellent performance of SVCM. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1084-1098 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.881742 File-URL: http://hdl.handle.net/10.1080/01621459.2014.881742 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1084-1098 Template-Type: ReDIF-Article 1.0 Author-Name: Valentin Patilea Author-X-Name-First: Valentin Author-X-Name-Last: Patilea Author-Name: Hamdi Raïssi Author-X-Name-First: Hamdi Author-X-Name-Last: Raïssi Title: Testing Second-Order Dynamics for Autoregressive Processes in Presence of Time-Varying Variance Abstract: This article considers the volatility modeling for autoregressive univariate time series. A benchmark approach is the stationary autoregressive conditional heteroscedasticity (ARCH) model of Engle. Motivated by real data evidence, processes with nonconstant unconditional variance and ARCH effects have been recently introduced. We take into account this type of nonstationarity in variance and propose simple testing procedures for ARCH effects. Adaptive McLeod and Li's portmanteau and ARCH-LM tests for checking the presence of such second-order dynamics are provided. The standard versions of these tests, commonly used by practitioners, suppose constant unconditional variance. The failure of these standard tests with time-varying unconditional variance is highlighted. The theoretical results are illustrated by means of simulated and real data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1099-1111 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.884504 File-URL: http://hdl.handle.net/10.1080/01621459.2014.884504 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1099-1111 Template-Type: ReDIF-Article 1.0 Author-Name: Andrew Harvey Author-X-Name-First: Andrew Author-X-Name-Last: Harvey Author-Name: Alessandra Luati Author-X-Name-First: Alessandra Author-X-Name-Last: Luati Title: Filtering With Heavy Tails Abstract: An unobserved components model in which the signal is buried in noise that is non-Gaussian may throw up observations that, when judged by the Gaussian yardstick, are outliers. We describe an observation-driven model, based on a conditional Student's t-distribution, which is tractable and retains some of the desirable features of the linear Gaussian model. Letting the dynamics be driven by the score of the conditional distribution leads to a specification that is not only easy to implement, but which also facilitates the development of a comprehensive and relatively straightforward theory for the asymptotic distribution of the maximum likelihood estimator. The methods are illustrated with an application to rail travel in the United Kingdom. The final part of the article shows how the model may be extended to include explanatory variables. Journal: Journal of the American Statistical Association Pages: 1112-1122 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.887011 File-URL: http://hdl.handle.net/10.1080/01621459.2014.887011 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1112-1122 Template-Type: ReDIF-Article 1.0 Author-Name: Bo Wang Author-X-Name-First: Bo Author-X-Name-Last: Wang Author-Name: Jian Qing Shi Author-X-Name-First: Jian Qing Author-X-Name-Last: Shi Title: Generalized Gaussian Process Regression Model for Non-Gaussian Functional Data Abstract: In this article, we propose a generalized Gaussian process concurrent regression model for functional data, where the functional response variable has a binomial, Poisson, or other non-Gaussian distribution from an exponential family, while the covariates are mixed functional and scalar variables. The proposed model offers a nonparametric generalized concurrent regression method for functional data with multidimensional covariates, and provides a natural framework on modeling common mean structure and covariance structure simultaneously for repeatedly observed functional data. The mean structure provides overall information about the observations, while the covariance structure can be used to catch up the characteristic of each individual batch. The prior specification of covariance kernel enables us to accommodate a wide class of nonlinear models. The definition of the model, the inference, and the implementation as well as its asymptotic properties are discussed. Several numerical examples with different non-Gaussian response variables are presented. Some technical details and more numerical examples as well as an extension of the model are provided as supplementary materials. Journal: Journal of the American Statistical Association Pages: 1123-1133 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.889021 File-URL: http://hdl.handle.net/10.1080/01621459.2014.889021 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1123-1133 Template-Type: ReDIF-Article 1.0 Author-Name: Yong-Dao Zhou Author-X-Name-First: Yong-Dao Author-X-Name-Last: Zhou Author-Name: Hongquan Xu Author-X-Name-First: Hongquan Author-X-Name-Last: Xu Title: Space-Filling Fractional Factorial Designs Abstract: Fractional factorial designs are widely used in various scientific investigations and industrial applications. Level permutation of factors could alter their geometrical structures and statistical properties. This article studies space-filling properties of fractional factorial designs under two commonly used space-filling measures, discrepancy and maximin distance. When all possible level permutations are considered, the average discrepancy is expressed as a linear combination of generalized word length pattern for fractional factorial designs with any number of levels and any discrepancy defined by a reproducing kernel. Generalized minimum aberration designs are shown to have good space-filling properties on average in terms of both discrepancy and distance. Several novel relationships between distance distribution and generalized word length pattern are derived. It is also shown that level permutations can improve space-filling properties for many existing saturated designs. A two-step construction procedure is proposed and three-, four-, and five-level space-filling fractional factorial designs are obtained. These new designs have better space-filling properties, such as larger distance and lower discrepancy, than existing ones, and are recommended for use in practice. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1134-1144 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2013.873367 File-URL: http://hdl.handle.net/10.1080/01621459.2013.873367 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1134-1144 Template-Type: ReDIF-Article 1.0 Author-Name: Paul R. Rosenbaum Author-X-Name-First: Paul R. Author-X-Name-Last: Rosenbaum Title: Weighted M-statistics With Superior Design Sensitivity in Matched Observational Studies With Multiple Controls Abstract: In a nonrandomized or observational study, a weak association between receipt of the treatment and an outcome may be explained not as effects caused by the treatment but rather by a small bias in the assignment of individuals to treatment or control; however, a strong association may be explained as noncausal only by a large bias. The strength of the association between treatment and outcome is not uniform across the data from a study, and this motivates giving greater weight where the association is stronger. In an observational study with treated-control matched pairs, it is known that results are less sensitive to unmeasured biases if pairs with small absolute differences in outcomes are given little weight in the analysis; more precisely, such a test statistic has superior design sensitivity. How should outcomes be weighted if an observational study is matched in sets with one treated subject and several controls? An M-statistic is the quantity equated to zero in defining Huber's M-estimates, including the mean, and it is used in testing hypotheses and setting confidence limits. In matched sets, a weighted M-statistic increases the weight of some matched sets and decreases the weight of others. Not unlike the case of matched pairs, weighted M-statistics with suitable weights have larger design sensitivities, and hence greater power in a sensitivity analysis, than unweighted statistics for symmetric unimodal errors, such as Normal, logistic, or t-distributed errors. This issue is examined using an asymptotic measure, the design sensitivity, and using simulation. For one Normal sampling situation, weighting the matched sets increased the power of a 0.05 level sensitivity analysis from 0.05 without weights to 0.75 with weights. An example from NHANES 2009-2010 concerning methylmercury in the blood of people who consume large amounts of fish is used to illustrate. Journal: Journal of the American Statistical Association Pages: 1145-1158 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2013.879261 File-URL: http://hdl.handle.net/10.1080/01621459.2013.879261 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1145-1158 Template-Type: ReDIF-Article 1.0 Author-Name: Peisong Han Author-X-Name-First: Peisong Author-X-Name-Last: Han Title: Multiply Robust Estimation in Regression Analysis With Missing Data Abstract: Doubly robust estimators are widely used in missing-data analysis. They provide double protection on estimation consistency against model misspecifications. However, they allow only a single model for the missingness mechanism and a single model for the data distribution, and the assumption that one of these two models is correctly specified is restrictive in practice. For regression analysis with possibly missing outcome, we propose an estimation method that allows multiple models for both the missingness mechanism and the data distribution. The resulting estimator is consistent if any one of those multiple models is correctly specified, and thus provides multiple protection on consistency. This estimator is also robust against extreme values of the fitted missingness probability, which, for most doubly robust estimators, can lead to erroneously large inverse probability weights that may jeopardize the numerical performance. The numerical implementation of the proposed method through a modified Newton-Raphson algorithm is discussed. The asymptotic distribution of the resulting estimator is derived, based on which we study the estimation efficiency and provide ways to improve the efficiency. As an application, we analyze the data collected from the AIDS Clinical Trials Group Protocol 175. Journal: Journal of the American Statistical Association Pages: 1159-1173 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.880058 File-URL: http://hdl.handle.net/10.1080/01621459.2014.880058 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1159-1173 Template-Type: ReDIF-Article 1.0 Author-Name: Tianle Chen Author-X-Name-First: Tianle Author-X-Name-Last: Chen Author-Name: Yuanjia Wang Author-X-Name-First: Yuanjia Author-X-Name-Last: Wang Author-Name: Huaihou Chen Author-X-Name-First: Huaihou Author-X-Name-Last: Chen Author-Name: Karen Marder Author-X-Name-First: Karen Author-X-Name-Last: Marder Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Title: Targeted Local Support Vector Machine for Age-Dependent Classification Abstract: We develop methods to accurately predict whether presymptomatic individuals are at risk of a disease based on their various marker profiles, which offers an opportunity for early intervention well before definitive clinical diagnosis. For many diseases, existing clinical literature may suggest the risk of disease varies with some markers of biological and etiological importance, for example, age. To identify effective prediction rules using nonparametric decision functions, standard statistical learning approaches treat markers with clear biological importance (e.g., age) and other markers without prior knowledge on disease etiology interchangeably as input variables. Therefore, these approaches may be inadequate in singling out and preserving the effects from the biologically important variables, especially in the presence of potential noise markers. Using age as an example of a salient marker to receive special care in the analysis, we propose a local smoothing large margin classifier implemented with support vector machine (SVM) to construct effective age-dependent classification rules. The method adaptively adjusts age effect and separately tunes age and other markers to achieve optimal performance. We derive the asymptotic risk bound of the local smoothing SVM and perform extensive simulation studies to compare with standard approaches. We apply the proposed method to two studies of premanifest Huntington's disease (HD) subjects and controls to construct age-sensitive predictive scores for the risk of HD and risk of receiving HD diagnosis during the study period. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1174-1187 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.881743 File-URL: http://hdl.handle.net/10.1080/01621459.2014.881743 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1174-1187 Template-Type: ReDIF-Article 1.0 Author-Name: Bing Li Author-X-Name-First: Bing Author-X-Name-Last: Li Author-Name: Hyonho Chun Author-X-Name-First: Hyonho Author-X-Name-Last: Chun Author-Name: Hongyu Zhao Author-X-Name-First: Hongyu Author-X-Name-Last: Zhao Title: On an Additive Semigraphoid Model for Statistical Networks With Application to Pathway Analysis Abstract: We introduce a nonparametric method for estimating non-Gaussian graphical models based on a new statistical relation called additive conditional independence, which is a three-way relation among random vectors that resembles the logical structure of conditional independence. Additive conditional independence allows us to use one-dimensional kernel regardless of the dimension of the graph, which not only avoids the curse of dimensionality but also simplifies computation. It also gives rise to a parallel structure to the Gaussian graphical model that replaces the precision matrix by an additive precision operator. The estimators derived from additive conditional independence cover the recently introduced nonparanormal graphical model as a special case, but outperform it when the Gaussian copula assumption is violated. We compare the new method with existing ones by simulations and in genetic pathway analysis. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1188-1204 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.882842 File-URL: http://hdl.handle.net/10.1080/01621459.2014.882842 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1188-1204 Template-Type: ReDIF-Article 1.0 Author-Name: Yehua Li Author-X-Name-First: Yehua Author-X-Name-Last: Li Author-Name: Yongtao Guan Author-X-Name-First: Yongtao Author-X-Name-Last: Guan Title: Functional Principal Component Analysis of Spatiotemporal Point Processes With Applications in Disease Surveillance Abstract: In disease surveillance applications, the disease events are modeled by spatiotemporal point processes. We propose a new class of semiparametric generalized linear mixed model for such data, where the event rate is related to some known risk factors and some unknown latent random effects. We model the latent spatiotemporal process as spatially correlated functional data, and propose Poisson maximum likelihood and composite likelihood methods based on spline approximations to estimate the mean and covariance functions of the latent process. By performing functional principal component analysis to the latent process, we can better understand the correlation structure in the point process. We also propose an empirical Bayes method to predict the latent spatial random effects, which can help highlight hot areas with unusually high event rates. Under an increasing domain and increasing knots asymptotic framework, we establish the asymptotic distribution for the parametric components in the model and the asymptotic convergence rates for the functional principal component estimators. We illustrate the methodology through a simulation study and an application to the Connecticut Tumor Registry data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1205-1215 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.885434 File-URL: http://hdl.handle.net/10.1080/01621459.2014.885434 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1205-1215 Template-Type: ReDIF-Article 1.0 Author-Name: Michael Rosenblum Author-X-Name-First: Michael Author-X-Name-Last: Rosenblum Author-Name: Han Liu Author-X-Name-First: Han Author-X-Name-Last: Liu Author-Name: En-Hsu Yen Author-X-Name-First: En-Hsu Author-X-Name-Last: Yen Title: Optimal Tests of Treatment Effects for the Overall Population and Two Subpopulations in Randomized Trials, Using Sparse Linear Programming Abstract: We propose new, optimal methods for analyzing randomized trials, when it is suspected that treatment effects may differ in two predefined subpopulations. Such subpopulations could be defined by a biomarker or risk factor measured at baseline. The goal is to simultaneously learn which subpopulations benefit from an experimental treatment, while providing strong control of the familywise Type I error rate. We formalize this as a multiple testing problem and show it is computationally infeasible to solve using existing techniques. Our solution involves a novel approach, in which we first transform the original multiple testing problem into a large, sparse linear program. We then solve this problem using advanced optimization techniques. This general method can solve a variety of multiple testing problems and decision theory problems related to optimal trial design, for which no solution was previously available. In particular, we construct new multiple testing procedures that satisfy minimax and Bayes optimality criteria. For a given optimality criterion, our new approach yields the optimal tradeoff between power to detect an effect in the overall population versus power to detect effects in subpopulations. We demonstrate our approach in examples motivated by two randomized trials of new treatments for HIV. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1216-1228 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2013.879063 File-URL: http://hdl.handle.net/10.1080/01621459.2013.879063 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1216-1228 Template-Type: ReDIF-Article 1.0 Author-Name: Shan Luo Author-X-Name-First: Shan Author-X-Name-Last: Luo Author-Name: Zehua Chen Author-X-Name-First: Zehua Author-X-Name-Last: Chen Title: Sequential Lasso Cum EBIC for Feature Selection With Ultra-High Dimensional Feature Space Abstract: In this article, we propose a method called sequential Lasso (SLasso) for feature selection in sparse high-dimensional linear models. The SLasso selects features by sequentially solving partially penalized least squares problems where the features selected in earlier steps are not penalized. The SLasso uses extended BIC (EBIC) as the stopping rule. The procedure stops when EBIC reaches a minimum. The asymptotic properties of SLasso are considered when the dimension of the feature space is ultra high and the number of relevant feature diverges. We show that, with probability converging to 1, the SLasso first selects all the relevant features before any irrelevant features can be selected, and that the EBIC decreases until it attains the minimum at the model consisting of exactly all the relevant features and then begins to increase. These results establish the selection consistency of SLasso. The SLasso estimators of the final model are ordinary least squares estimators. The selection consistency implies the oracle property of SLasso. The asymptotic distribution of the SLasso estimators with diverging number of relevant features is provided. The SLasso is compared with other methods by simulation studies, which demonstrates that SLasso is a desirable approach having an edge over the other methods. The SLasso together with the other methods are applied to a microarray data for mapping disease genes. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1229-1240 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2013.877275 File-URL: http://hdl.handle.net/10.1080/01621459.2013.877275 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1229-1240 Template-Type: ReDIF-Article 1.0 Author-Name: Alexander Aue Author-X-Name-First: Alexander Author-X-Name-Last: Aue Author-Name: Rex C. Y. Cheung Author-X-Name-First: Rex C. Y. Author-X-Name-Last: Cheung Author-Name: Thomas C. M. Lee Author-X-Name-First: Thomas C. M. Author-X-Name-Last: Lee Author-Name: Ming Zhong Author-X-Name-First: Ming Author-X-Name-Last: Zhong Title: Segmented Model Selection in Quantile Regression Using the Minimum Description Length Principle Abstract: This article proposes new model-fitting techniques for quantiles of an observed data sequence, including methods for data segmentation and variable selection. The main contribution, however, is in providing a means to perform these two tasks simultaneously. This is achieved by matching the data with the best-fitting piecewise quantile regression model, where the fit is determined by a penalization derived from the minimum description length principle. The resulting optimization problem is solved with the use of genetic algorithms. The proposed, fully automatic procedures are, unlike traditional break point procedures, not based on repeated hypothesis tests, and do not require, unlike most variable selection procedures, the specification of a tuning parameter. Theoretical large-sample properties are derived. Empirical comparisons with existing break point and variable selection methods for quantiles indicate that the new procedures work well in practice. Journal: Journal of the American Statistical Association Pages: 1241-1256 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.889022 File-URL: http://hdl.handle.net/10.1080/01621459.2014.889022 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1241-1256 Template-Type: ReDIF-Article 1.0 Author-Name: Chen Xu Author-X-Name-First: Chen Author-X-Name-Last: Xu Author-Name: Jiahua Chen Author-X-Name-First: Jiahua Author-X-Name-Last: Chen Title: The Sparse MLE for Ultrahigh-Dimensional Feature Screening Abstract: Feature selection is fundamental for modeling the high-dimensional data, where the number of features can be huge and much larger than the sample size. Since the feature space is so large, many traditional procedures become numerically infeasible. It is hence essential to first remove most apparently noninfluential features before any elaborative analysis. Recently, several procedures have been developed for this purpose, which include the sure-independent-screening (SIS) as a widely used technique. To gain computational efficiency, the SIS screens features based on their individual predicting power. In this article, we propose a new screening method via the sparsity-restricted maximum likelihood estimator (SMLE). The new method naturally takes the joint effects of features in the screening process, which gives itself an edge to potentially outperform the existing methods. This conjecture is further supported by the simulation studies under a number of modeling settings. We show that the proposed method is screening consistent in the context of ultrahigh-dimensional generalized linear models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1257-1269 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2013.879531 File-URL: http://hdl.handle.net/10.1080/01621459.2013.879531 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1257-1269 Template-Type: ReDIF-Article 1.0 Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Yunbei Ma Author-X-Name-First: Yunbei Author-X-Name-Last: Ma Author-Name: Wei Dai Author-X-Name-First: Wei Author-X-Name-Last: Dai Title: Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Varying Coefficient Models Abstract: The varying coefficient model is an important class of nonparametric statistical model, which allows us to examine how the effects of covariates vary with exposure variables. When the number of covariates is large, the issue of variable selection arises. In this article, we propose and investigate marginal nonparametric screening methods to screen variables in sparse ultra-high-dimensional varying coefficient models. The proposed nonparametric independence screening (NIS) selects variables by ranking a measure of the nonparametric marginal contributions of each covariate given the exposure variable. The sure independent screening property is established under some mild technical conditions when the dimensionality is of nonpolynomial order, and the dimensionality reduction of NIS is quantified. To enhance the practical utility and finite sample performance, two data-driven iterative NIS (INIS) methods are proposed for selecting thresholding parameters and variables: conditional permutation and greedy methods, resulting in conditional-INIS and greedy-INIS. The effectiveness and flexibility of the proposed methods are further illustrated by simulation studies and real data applications. Journal: Journal of the American Statistical Association Pages: 1270-1284 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2013.879828 File-URL: http://hdl.handle.net/10.1080/01621459.2013.879828 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1270-1284 Template-Type: ReDIF-Article 1.0 Author-Name: Ning Hao Author-X-Name-First: Ning Author-X-Name-Last: Hao Author-Name: Hao Helen Zhang Author-X-Name-First: Hao Helen Author-X-Name-Last: Zhang Title: Interaction Screening for Ultrahigh-Dimensional Data Abstract: In ultrahigh-dimensional data analysis, it is extremely challenging to identify important interaction effects, and a top concern in practice is computational feasibility. For a dataset with n observations and p predictors, the augmented design matrix including all linear and order-2 terms is of size n × (p-super-2 + 3p)/2. When p is large, say more than tens of hundreds, the number of interactions is enormous and beyond the capacity of standard machines and software tools for storage and analysis. In theory, the interaction-selection consistency is hard to achieve in high-dimensional settings. Interaction effects have heavier tails and more complex covariance structures than main effects in a random design, making theoretical analysis difficult. In this article, we propose to tackle these issues by forward-selection-based procedures called iFOR, which identify interaction effects in a greedy forward fashion while maintaining the natural hierarchical model structure. Two algorithms, iFORT and iFORM, are studied. Computationally, the iFOR procedures are designed to be simple and fast to implement. No complex optimization tools are needed, since only OLS-type calculations are involved; the iFOR algorithms avoid storing and manipulating the whole augmented matrix, so the memory and CPU requirement is minimal; the computational complexity is linear in p for sparse models, hence feasible for p >> n. Theoretically, we prove that they possess sure screening property for ultrahigh-dimensional settings. Numerical examples are used to demonstrate their finite sample performance. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1285-1301 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.881741 File-URL: http://hdl.handle.net/10.1080/01621459.2014.881741 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1285-1301 Template-Type: ReDIF-Article 1.0 Author-Name: Xiaofeng Shao Author-X-Name-First: Xiaofeng Author-X-Name-Last: Shao Author-Name: Jingsi Zhang Author-X-Name-First: Jingsi Author-X-Name-Last: Zhang Title: Martingale Difference Correlation and Its Use in High-Dimensional Variable Screening Abstract: In this article, we propose a new metric, the so-called martingale difference correlation, to measure the departure of conditional mean independence between a scalar response variable V and a vector predictor variable U. Our metric is a natural extension of distance correlation proposed by Székely, Rizzo, and Bahirov, which is used to measure the dependence between V and U. The martingale difference correlation and its empirical counterpart inherit a number of desirable features of distance correlation and sample distance correlation, such as algebraic simplicity and elegant theoretical properties. We further use martingale difference correlation as a marginal utility to do high-dimensional variable screening to screen out variables that do not contribute to conditional mean of the response given the covariates. Further extension to conditional quantile screening is also described in detail and sure screening properties are rigorously justified. Both simulation results and real data illustrations demonstrate the effectiveness of martingale difference correlation-based screening procedures in comparison with the existing counterparts. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1302-1318 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.887012 File-URL: http://hdl.handle.net/10.1080/01621459.2014.887012 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1302-1318 Template-Type: ReDIF-Article 1.0 Author-Name: Holger Dette Author-X-Name-First: Holger Author-X-Name-Last: Dette Author-Name: Ria Van Hecke Author-X-Name-First: Ria Author-X-Name-Last: Van Hecke Author-Name: Stanislav Volgushev Author-X-Name-First: Stanislav Author-X-Name-Last: Volgushev Title: Some Comments on Copula-Based Regression Abstract: In a recent article, Noh, El Ghouch, and Bouezmarni proposed a new semiparametric estimate of a regression function with a multivariate predictor, which is based on a specification of the dependence structure between the predictor and the response by means of a parametric copula. This comment investigates the effect which occurs under misspecification of the parametric model. We demonstrate by means of several examples that even for a one or two-dimensional predictor the error caused by a "wrong" specification of the parametric family is rather severe, if the regression is not monotone in one of the components of the predictor. Moreover, we also show that these problems occur for all of the commonly used copula families and we illustrate in several examples that the copula-based regression may lead to invalid results even when flexible copula models such as vine copulas (with the common parametric families) are used in the estimation procedure. Journal: Journal of the American Statistical Association Pages: 1319-1324 Issue: 507 Volume: 109 Year: 2014 Month: 9 X-DOI: 10.1080/01621459.2014.916577 File-URL: http://hdl.handle.net/10.1080/01621459.2014.916577 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1319-1324 Template-Type: ReDIF-Article 1.0 Author-Name: Michael R. Wierzbicki Author-X-Name-First: Michael R. Author-X-Name-Last: Wierzbicki Author-Name: Li-Bing Guo Author-X-Name-First: Li-Bing Author-X-Name-Last: Guo Author-Name: Qing-Tao Du Author-X-Name-First: Qing-Tao Author-X-Name-Last: Du Author-Name: Wensheng Guo Author-X-Name-First: Wensheng Author-X-Name-Last: Guo Title: Sparse Semiparametric Nonlinear Model With Application to Chromatographic Fingerprints Abstract: Traditional Chinese herbal medications (TCHMs) are composed of a multitude of compounds and the identification of their active composition is an important area of research. Chromatography provides a visual representation of a TCHM sample's composition by outputting a curve characterized by spikes corresponding to compounds in the sample. Across different experimental conditions, the location of the spikes can be shifted, preventing direct comparison of curves and forcing compound identification to be possible only within each experiment. In this article, we propose a sparse semiparametric nonlinear modeling framework for the establishment of a standardized chromatographic fingerprint. Data-driven basis expansion is used to model the common shape of the curves, while a parametric time warping function registers across individual curves. Penalized weighted least-squares with the adaptive lasso penalty provides a unified criterion for registration, model selection, and estimation. Furthermore, the adaptive lasso estimators possess attractive sampling properties. A back-fitting algorithm is proposed for estimation. Performance is assessed through simulation and we apply the model to chromatographic data of rhubarb collected from different experimental conditions and establish a standardized fingerprint as a first step in TCHM research. Journal: Journal of the American Statistical Association Pages: 1339-1349 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2013.836969 File-URL: http://hdl.handle.net/10.1080/01621459.2013.836969 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1339-1349 Template-Type: ReDIF-Article 1.0 Author-Name: Pang Du Author-X-Name-First: Pang Author-X-Name-Last: Du Title: Comment Journal: Journal of the American Statistical Association Pages: 1349-1350 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.926686 File-URL: http://hdl.handle.net/10.1080/01621459.2014.926686 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1349-1350 Template-Type: ReDIF-Article 1.0 Author-Name: Huaihou Chen Author-X-Name-First: Huaihou Author-X-Name-Last: Chen Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Title: Comment Journal: Journal of the American Statistical Association Pages: 1350-1353 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.972158 File-URL: http://hdl.handle.net/10.1080/01621459.2014.972158 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1350-1353 Template-Type: ReDIF-Article 1.0 Author-Name: Michael R. Wierzbicki Author-X-Name-First: Michael R. Author-X-Name-Last: Wierzbicki Author-Name: Li-Bing Guo Author-X-Name-First: Li-Bing Author-X-Name-Last: Guo Author-Name: Qing-Tao Du Author-X-Name-First: Qing-Tao Author-X-Name-Last: Du Author-Name: Wensheng Guo Author-X-Name-First: Wensheng Author-X-Name-Last: Guo Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 1353-1354 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.972161 File-URL: http://hdl.handle.net/10.1080/01621459.2014.972161 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1353-1354 Template-Type: ReDIF-Article 1.0 Author-Name: Yoonsuh Jung Author-X-Name-First: Yoonsuh Author-X-Name-Last: Jung Author-Name: Jianhua Z. Huang Author-X-Name-First: Jianhua Z. Author-X-Name-Last: Huang Author-Name: Jianhua Hu Author-X-Name-First: Jianhua Author-X-Name-Last: Hu Title: Biomarker Detection in Association Studies: Modeling SNPs Simultaneously via Logistic ANOVA Abstract: In genome-wide association studies, the primary task is to detect biomarkers in the form of single nucleotide polymorphisms (SNPs) that have nontrivial associations with a disease phenotype and some other important clinical/environmental factors. However, the extremely large number of SNPs compared to the sample size inhibits application of classical methods such as the multiple logistic regression. Currently, the most commonly used approach is still to analyze one SNP at a time. In this article, we propose to consider the genotypes of the SNPs simultaneously via a logistic analysis of variance (ANOVA) model, which expresses the logit transformed mean of SNP genotypes as the summation of the SNP effects, effects of the disease phenotype and/or other clinical variables, and the interaction effects. We use a reduced-rank representation of the interaction-effect matrix for dimensionality reduction, and employ the L1-penalty in a penalized likelihood framework to filter out the SNPs that have no associations. We develop a majorization-minimization algorithm for computational implementation. In addition, we propose a modified BIC criterion to select the penalty parameters and determine the rank number. The proposed method is applied to a multiple sclerosis dataset and simulated datasets and shows promise in biomarker detection. Journal: Journal of the American Statistical Association Pages: 1355-1367 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.928217 File-URL: http://hdl.handle.net/10.1080/01621459.2014.928217 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1355-1367 Template-Type: ReDIF-Article 1.0 Author-Name: Jonathan R. Stroud Author-X-Name-First: Jonathan R. Author-X-Name-Last: Stroud Author-Name: Michael S. Johannes Author-X-Name-First: Michael S. Author-X-Name-Last: Johannes Title: Bayesian Modeling and Forecasting of 24-Hour High-Frequency Volatility Abstract: This article estimates models of high-frequency index futures returns using "around-the-clock" 5-min returns that incorporate the following key features: multiple persistent stochastic volatility factors, jumps in prices and volatilities, seasonal components capturing time of the day patterns, correlations between return and volatility shocks, and announcement effects. We develop an integrated MCMC approach to estimate interday and intraday parameters and states using high-frequency data without resorting to various aggregation measures like realized volatility. We provide a case study using financial crisis data from 2007 to 2009, and use particle filters to construct likelihood functions for model comparison and out-of-sample forecasting from 2009 to 2012. We show that our approach improves realized volatility forecasts by up to 50% over existing benchmarks and is also useful for risk management and trading applications. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1368-1384 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.937003 File-URL: http://hdl.handle.net/10.1080/01621459.2014.937003 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1368-1384 Template-Type: ReDIF-Article 1.0 Author-Name: Dimitris Rizopoulos Author-X-Name-First: Dimitris Author-X-Name-Last: Rizopoulos Author-Name: Laura A. Hatfield Author-X-Name-First: Laura A. Author-X-Name-Last: Hatfield Author-Name: Bradley P. Carlin Author-X-Name-First: Bradley P. Author-X-Name-Last: Carlin Author-Name: Johanna J. M. Takkenberg Author-X-Name-First: Johanna J. M. Author-X-Name-Last: Takkenberg Title: Combining Dynamic Predictions From Joint Models for Longitudinal and Time-to-Event Data Using Bayesian Model Averaging Abstract: The joint modeling of longitudinal and time-to-event data is an active area of statistics research that has received a lot of attention in recent years. More recently, a new and attractive application of this type of model has been to obtain individualized predictions of survival probabilities and/or of future longitudinal responses. The advantageous feature of these predictions is that they are dynamically updated as extra longitudinal responses are collected for the subjects of interest, providing real time risk assessment using all recorded information. The aim of this article is two-fold. First, to highlight the importance of modeling the association structure between the longitudinal and event time responses that can greatly influence the derived predictions, and second, to illustrate how we can improve the accuracy of the derived predictions by suitably combining joint models with different association structures. The second goal is achieved using Bayesian model averaging, which, in this setting, has the very intriguing feature that the model weights are not fixed but they are rather subject- and time-dependent, implying that at different follow-up times predictions for the same subject may be based on different models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1385-1397 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.931236 File-URL: http://hdl.handle.net/10.1080/01621459.2014.931236 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1385-1397 Template-Type: ReDIF-Article 1.0 Author-Name: Marian Farah Author-X-Name-First: Marian Author-X-Name-Last: Farah Author-Name: Paul Birrell Author-X-Name-First: Paul Author-X-Name-Last: Birrell Author-Name: Stefano Conti Author-X-Name-First: Stefano Author-X-Name-Last: Conti Author-Name: Daniela De Angelis Author-X-Name-First: Daniela De Author-X-Name-Last: Angelis Title: Bayesian Emulation and Calibration of a Dynamic Epidemic Model for A/H1N1 Influenza Abstract: In this article, we develop a Bayesian framework for parameter estimation of a computationally expensive dynamic epidemic model using time series epidemic data. Specifically, we work with a model for A/H1N1 influenza, which is implemented as a deterministic computer simulator, taking as input the underlying epidemic parameters and calculating the corresponding time series of reported infections. To obtain Bayesian inference for the epidemic parameters, the simulator is embedded in the likelihood for the reported epidemic data. However, the simulator is computationally slow, making it impractical to use in Bayesian estimation where a large number of simulator runs is required. We propose an efficient approximation to the simulator using an emulator, a statistical model that combines a Gaussian process (GP) prior for the output function of the simulator with a dynamic linear model (DLM) for its evolution through time. This modeling framework is both flexible and tractable, resulting in efficient posterior inference through Markov chain Monte Carlo (MCMC). The proposed dynamic emulator is then used in a calibration procedure to obtain posterior inference for the parameters of the influenza epidemic. Journal: Journal of the American Statistical Association Pages: 1398-1411 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.934453 File-URL: http://hdl.handle.net/10.1080/01621459.2014.934453 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1398-1411 Template-Type: ReDIF-Article 1.0 Author-Name: Hui Huang Author-X-Name-First: Hui Author-X-Name-Last: Huang Author-Name: Yehua Li Author-X-Name-First: Yehua Author-X-Name-Last: Li Author-Name: Yongtao Guan Author-X-Name-First: Yongtao Author-X-Name-Last: Guan Title: Joint Modeling and Clustering Paired Generalized Longitudinal Trajectories With Application to Cocaine Abuse Treatment Data Abstract: In a cocaine dependence treatment study, we have paired binary longitudinal trajectories that record the cocaine use patterns of each patient before and after a treatment. To better understand the drug-using behaviors among the patients, we propose a general framework based on functional data analysis to jointly model and cluster these paired non-Gaussian longitudinal trajectories. Our approach assumes that the response variables follow distributions from the exponential family, with the canonical parameters determined by some latent Gaussian processes. To reduce the dimensionality of the latent processes, we express them by a truncated Karhunen-Lóeve (KL) expansion allowing the mean and covariance functions to be different across clusters. We further represent the mean and eigenfunctions functions by flexible spline bases, and determine the orders of the truncated KL expansions using data-driven methods. By treating the cluster membership as a missing value, we cluster the cocaine use trajectories by a likelihood-based approach. The cluster membership and parameter estimates are jointly estimated by a Monte Carlo EM algorithm with Gibbs sampling steps. We discover subgroups of patients with distinct behaviors in terms of overall probability to use, binge verses periodic use pattern, etc. The joint modeling approach also sheds new lights on relating relapse behavior to baseline pattern in each subgroup. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1412-1424 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.957286 File-URL: http://hdl.handle.net/10.1080/01621459.2014.957286 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1412-1424 Template-Type: ReDIF-Article 1.0 Author-Name: Jonathan E. Gellar Author-X-Name-First: Jonathan E. Author-X-Name-Last: Gellar Author-Name: Elizabeth Colantuoni Author-X-Name-First: Elizabeth Author-X-Name-Last: Colantuoni Author-Name: Dale M. Needham Author-X-Name-First: Dale M. Author-X-Name-Last: Needham Author-Name: Ciprian M. Crainiceanu Author-X-Name-First: Ciprian M. Author-X-Name-Last: Crainiceanu Title: Variable-Domain Functional Regression for Modeling ICU Data Abstract: We introduce a class of scalar-on-function regression models with subject-specific functional predictor domains. The fundamental idea is to consider a bivariate functional parameter that depends both on the functional argument and on the width of the functional predictor domain. Both parametric and nonparametric models are introduced to fit the functional coefficient. The nonparametric model is theoretically and practically invariant to functional support transformation, or support registration. Methods were motivated by and applied to a study of association between daily measures of the Intensive Care Unit (ICU) sequential organ failure assessment (SOFA) score and two outcomes: in-hospital mortality, and physical impairment at hospital discharge among survivors. Methods are generally applicable to a large number of new studies that record a continuous variables over unequal domains. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1425-1439 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.940044 File-URL: http://hdl.handle.net/10.1080/01621459.2014.940044 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1425-1439 Template-Type: ReDIF-Article 1.0 Author-Name: Daniel J. Graham Author-X-Name-First: Daniel J. Author-X-Name-Last: Graham Author-Name: Emma J. McCoy Author-X-Name-First: Emma J. Author-X-Name-Last: McCoy Author-Name: David A. Stephens Author-X-Name-First: David A. Author-X-Name-Last: Stephens Title: Quantifying Causal Effects of Road Network Capacity Expansions on Traffic Volume and Density via a Mixed Model Propensity Score Estimator Abstract: Road network capacity expansions are frequently proposed as solutions to urban traffic congestion but are controversial because it is thought that they can directly "induce" growth in traffic volumes. This article quantifies causal effects of road network capacity expansions on aggregate urban traffic volume and density in U.S. cities using a mixed model propensity score (PS) estimator. The motivation for this approach is that we seek to estimate a dose-response relationship between capacity and volume but suspect confounding from both observed and unobserved characteristics. Analytical results and simulations show that a longitudinal mixed model PS approach can be used to adjust effectively for time-invariant unobserved confounding via random effects (RE). Our empirical results indicate that network capacity expansions can cause substantial increases in aggregate urban traffic volumes such that even major capacity increases can actually lead to little or no reduction in network traffic densities. This result has important implications for optimal urban transportation strategies. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1440-1449 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.956871 File-URL: http://hdl.handle.net/10.1080/01621459.2014.956871 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1440-1449 Template-Type: ReDIF-Article 1.0 Author-Name: Dungang Liu Author-X-Name-First: Dungang Author-X-Name-Last: Liu Author-Name: Regina Y. Liu Author-X-Name-First: Regina Y. Author-X-Name-Last: Liu Author-Name: Min-ge Xie Author-X-Name-First: Min-ge Author-X-Name-Last: Xie Title: Exact Meta-Analysis Approach for Discrete Data and its Application to 2 × 2 Tables With Rare Events Abstract: This article proposes a general exact meta-analysis approach for synthesizing inferences from multiple studies of discrete data. The approach combines the p-value functions (also known as significance functions) associated with the exact tests from individual studies. It encompasses a broad class of exact meta-analysis methods, as it permits broad choices for the combining elements, such as tests used in individual studies, and any parameter of interest. The approach yields statements that explicitly account for the impact of individual studies on the overall inference, in terms of efficiency/power and the Type I error rate. Those statements also give rises to empirical methods for further enhancing the combined inference. Although the proposed approach is for general discrete settings, for convenience, it is illustrated throughout using the setting of meta-analysis of multiple 2 × 2 tables. In the context of rare events data, such as observing few, zero, or zero total (i.e., zero events in both arms) outcomes in binomial trials or 2 × 2 tables, most existing meta-analysis methods rely on the large-sample approximations which may yield invalid inference. The commonly used corrections to zero outcomes in rare events data, aiming to improve numerical performance can also incur undesirable consequences. The proposed approach applies readily to any rare event setting, including even the zero total event studies without any artificial correction. While debates continue on whether or how zero total event studies should be incorporated in meta-analysis, the proposed approach has the advantage of automatically including those studies and thus making use of all available data. Through numerical studies in rare events settings, the proposed exact approach is shown to be efficient and, generally, outperform commonly used meta-analysis methods, including Mantel-Haenszel and Peto methods. Journal: Journal of the American Statistical Association Pages: 1450-1465 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.946318 File-URL: http://hdl.handle.net/10.1080/01621459.2014.946318 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1450-1465 Template-Type: ReDIF-Article 1.0 Author-Name: Edoardo M. Airoldi Author-X-Name-First: Edoardo M. Author-X-Name-Last: Airoldi Author-Name: Thiago Costa Author-X-Name-First: Thiago Author-X-Name-Last: Costa Author-Name: Federico Bassetti Author-X-Name-First: Federico Author-X-Name-Last: Bassetti Author-Name: Fabrizio Leisen Author-X-Name-First: Fabrizio Author-X-Name-Last: Leisen Author-Name: Michele Guindani Author-X-Name-First: Michele Author-X-Name-Last: Guindani Title: Generalized Species Sampling Priors With Latent Beta Reinforcements Abstract: Many popular Bayesian nonparametric priors can be characterized in terms of exchangeable species sampling sequences. However, in some applications, exchangeability may not be appropriate. We introduce a novel and probabilistically coherent family of nonexchangeable species sampling sequences characterized by a tractable predictive probability function with weights driven by a sequence of independent Beta random variables. We compare their theoretical clustering properties with those of the Dirichlet process and the two parameters Poisson-Dirichlet process. The proposed construction provides a complete characterization of the joint process, differently from existing work. We then propose the use of such process as prior distribution in a hierarchical Bayes' modeling framework, and we describe a Markov chain Monte Carlo sampler for posterior inference. We evaluate the performance of the prior and the robustness of the resulting inference in a simulation study, providing a comparison with popular Dirichlet process mixtures and hidden Markov models. Finally, we develop an application to the detection of chromosomal aberrations in breast cancer by leveraging array comparative genomic hybridization (CGH) data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1466-1480 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.950735 File-URL: http://hdl.handle.net/10.1080/01621459.2014.950735 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1466-1480 Template-Type: ReDIF-Article 1.0 Author-Name: Kelvin Gu Author-X-Name-First: Kelvin Author-X-Name-Last: Gu Author-Name: Debdeep Pati Author-X-Name-First: Debdeep Author-X-Name-Last: Pati Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Bayesian Multiscale Modeling of Closed Curves in Point Clouds Abstract: Modeling object boundaries based on image or point cloud data is frequently necessary in medical and scientific applications ranging from detecting tumor contours for targeted radiation therapy, to the classification of organisms based on their structural information. In low-contrast images or sparse and noisy point clouds, there is often insufficient data to recover local segments of the boundary in isolation. Thus, it becomes critical to model the entire boundary in the form of a closed curve. To achieve this, we develop a Bayesian hierarchical model that expresses highly diverse 2D objects in the form of closed curves. The model is based on a novel multiscale deformation process. By relating multiple objects through a hierarchical formulation, we can successfully recover missing boundaries by borrowing structural information from similar objects at the appropriate scale. Furthermore, the model's latent parameters help interpret the population, indicating dimensions of significant structural variability and also specifying a "central curve" that summarizes the collection. Theoretical properties of our prior are studied in specific cases and efficient Markov chain Monte Carlo methods are developed, evaluated through simulation examples and applied to panorex teeth images for modeling teeth contours and also to a brain tumor contour detection problem. Journal: Journal of the American Statistical Association Pages: 1481-1494 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.934825 File-URL: http://hdl.handle.net/10.1080/01621459.2014.934825 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1481-1494 Template-Type: ReDIF-Article 1.0 Author-Name: Qing Zhou Author-X-Name-First: Qing Author-X-Name-Last: Zhou Title: Monte Carlo Simulation for Lasso-Type Problems by Estimator Augmentation Abstract: Regularized linear regression under the ℓ1 penalty, such as the Lasso, has been shown to be effective in variable selection and sparse modeling. The sampling distribution of an ℓ1-penalized estimator is hard to determine as the estimator is defined by an optimization problem that in general can only be solved numerically and many of its components may be exactly zero. Let S be the subgradient of the ℓ1 norm of the coefficient vector β evaluated at . We find that the joint sampling distribution of and S, together called an augmented estimator, is much more tractable and has a closed-form density under a normal error distribution in both low-dimensional (pn) and high-dimensional (p > n) settings. Given β and the error variance σ-super-2, one may employ standard Monte Carlo methods, such as Markov chain Monte Carlo (MCMC) and importance sampling (IS), to draw samples from the distribution of the augmented estimator and calculate expectations with respect to the sampling distribution of . We develop a few concrete Monte Carlo algorithms and demonstrate with numerical examples that our approach may offer huge advantages and great flexibility in studying sampling distributions in ℓ1-penalized linear regression. We also establish nonasymptotic bounds on the difference between the true sampling distribution of and its estimator obtained by plugging in estimated parameters, which justifies the validity of Monte Carlo simulation from an estimated sampling distribution even when p >> n → ∞. Journal: Journal of the American Statistical Association Pages: 1495-1516 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.946035 File-URL: http://hdl.handle.net/10.1080/01621459.2014.946035 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1495-1516 Template-Type: ReDIF-Article 1.0 Author-Name: Lu Tian Author-X-Name-First: Lu Author-X-Name-Last: Tian Author-Name: Ash A. Alizadeh Author-X-Name-First: Ash A. Author-X-Name-Last: Alizadeh Author-Name: Andrew J. Gentles Author-X-Name-First: Andrew J. Author-X-Name-Last: Gentles Author-Name: Robert Tibshirani Author-X-Name-First: Robert Author-X-Name-Last: Tibshirani Title: A Simple Method for Estimating Interactions Between a Treatment and a Large Number of Covariates Abstract: We consider a setting in which we have a treatment and a potentially large number of covariates for a set of observations, and wish to model their relationship with an outcome of interest. We propose a simple method for modeling interactions between the treatment and covariates. The idea is to modify the covariate in a simple way, and then fit a standard model using the modified covariates and no main effects. We show that coupled with an efficiency augmentation procedure, this method produces clinically meaningful estimators in a variety of settings. It can be useful for practicing personalized medicine: determining from a large set of biomarkers, the subset of patients that can potentially benefit from a treatment. We apply the method to both simulated datasets and real trial data. The modified covariates idea can be used for other purposes, for example, large scale hypothesis testing for determining which of a set of covariates interact with a treatment variable. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1517-1532 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.951443 File-URL: http://hdl.handle.net/10.1080/01621459.2014.951443 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1517-1532 Template-Type: ReDIF-Article 1.0 Author-Name: Y. J. Hu Author-X-Name-First: Y. J. Author-X-Name-Last: Hu Author-Name: D. Y. Lin Author-X-Name-First: D. Y. Author-X-Name-Last: Lin Author-Name: W. Sun Author-X-Name-First: W. Author-X-Name-Last: Sun Author-Name: D. Zeng Author-X-Name-First: D. Author-X-Name-Last: Zeng Title: A Likelihood-Based Framework for Association Analysis of Allele-Specific Copy Numbers Abstract: Copy number variants (CNVs) and single nucleotide polymorphisms (SNPs) coexist throughout the human genome and jointly contribute to phenotypic variations. Thus, it is desirable to consider both types of variants, as characterized by allele-specific copy numbers (ASCNs), in association studies of complex human diseases. Current SNP genotyping technologies capture the CNV and SNP information simultaneously via fluorescent intensity measurements. The common practice of calling ASCNs from the intensity measurements and then using the ASCN calls in downstream association analysis has important limitations. First, the association tests are prone to false-positive findings when differential measurement errors between cases and controls arise from differences in DNA quality or handling. Second, the uncertainties in the ASCN calls are ignored. We present a general framework for the integrated analysis of CNVs and SNPs, including the analysis of total copy numbers as a special case. Our approach combines the ASCN calling and the association analysis into a single step while allowing for differential measurement errors. We construct likelihood functions that properly account for case-control sampling and measurement errors. We establish the asymptotic properties of the maximum likelihood estimators and develop EM algorithms to implement the corresponding inference procedures. The advantages of the proposed methods over the existing ones are demonstrated through realistic simulation studies and an application to a genome-wide association study of schizophrenia. Extensions to next-generation sequencing data are discussed. Journal: Journal of the American Statistical Association Pages: 1533-1545 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.908777 File-URL: http://hdl.handle.net/10.1080/01621459.2014.908777 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1533-1545 Template-Type: ReDIF-Article 1.0 Author-Name: Zudi Lu Author-X-Name-First: Zudi Author-X-Name-Last: Lu Author-Name: Dag Tjøstheim Author-X-Name-First: Dag Author-X-Name-Last: Tjøstheim Title: Nonparametric Estimation of Probability Density Functions for Irregularly Observed Spatial Data Abstract: Nonparametric estimation of probability density functions, both marginal and joint densities, is a very useful tool in statistics. The kernel method is popular and applicable to dependent data, including time series and spatial data. But at least for the joint density, one has had to assume that data are observed at regular time intervals or on a regular grid in space. Though this is not very restrictive in the time series case, it often is in the spatial case. In fact, to a large degree it has precluded applications of nonparametric methods to spatial data because such data often are irregularly positioned over space. In this article, we propose nonparametric kernel estimators for both the marginal and in particular the joint probability density functions for nongridded spatial data. Large sample distributions of the proposed estimators are established under mild conditions, and a new framework of expanding-domain infill asymptotics is suggested to overcome the shortcomings of spatial asymptotics in the existing literature. A practical, reasonable selection of the bandwidths on the basis of cross-validation is also proposed. We demonstrate by both simulations and real data examples of moderate sample size that the proposed methodology is effective and useful in uncovering nonlinear spatial dependence for general, including non-Gaussian, distributions. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1546-1564 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.947376 File-URL: http://hdl.handle.net/10.1080/01621459.2014.947376 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1546-1564 Template-Type: ReDIF-Article 1.0 Author-Name: Fangpo Wang Author-X-Name-First: Fangpo Author-X-Name-Last: Wang Author-Name: Alan E. Gelfand Author-X-Name-First: Alan E. Author-X-Name-Last: Gelfand Title: Modeling Space and Space-Time Directional Data Using Projected Gaussian Processes Abstract: Directional data naturally arise in many scientific fields, such as oceanography (wave direction), meteorology (wind direction), and biology (animal movement direction). Our contribution is to develop a fully model-based approach to capture structured spatial dependence for modeling directional data at different spatial locations. We build a projected Gaussian spatial process, induced from an inline bivariate Gaussian spatial process. We discuss the properties of the projected Gaussian process and show how to fit this process as a model for data, using suitable latent variables, with Markov chain Monte Carlo methods. We also show how to implement spatial interpolation and conduct model comparison in this setting. Simulated examples are provided as proof of concept. A data application arises for modeling wave direction data in the Adriatic sea, off the coast of Italy. In fact, this directional data is available across time, requiring a spatio-temporal model for its analysis. We discuss and illustrate this extension. Journal: Journal of the American Statistical Association Pages: 1565-1580 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.934454 File-URL: http://hdl.handle.net/10.1080/01621459.2014.934454 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1565-1580 Template-Type: ReDIF-Article 1.0 Author-Name: Matthew Plumlee Author-X-Name-First: Matthew Author-X-Name-Last: Plumlee Title: Fast Prediction of Deterministic Functions Using Sparse Grid Experimental Designs Abstract: Random field models have been widely employed to develop a predictor of an expensive function based on observations from an experiment. The traditional framework for developing a predictor with random field models can fail due to the computational burden it requires. This problem is often seen in cases where the input of the expensive function is high dimensional. While many previous works have focused on developing an approximative predictor to resolve these issues, this article investigates a different solution mechanism. We demonstrate that when a general set of designs is employed, the resulting predictor is quick to compute and has reasonable accuracy. The fast computation of the predictor is made possible through an algorithm proposed by this work. This article also demonstrates methods to quickly evaluate the likelihood of the observations and describes some fast maximum likelihood estimates for unknown parameters of the random field. The computational savings can be several orders of magnitude when the input is located in a high-dimensional space. Beyond the fast computation of the predictor, existing research has demonstrated that a subset of these designs generate predictors that are asymptotically efficient. This work details some empirical comparisons to the more common space-filling designs that verify the designs are competitive in terms of resulting prediction accuracy. Journal: Journal of the American Statistical Association Pages: 1581-1591 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.900250 File-URL: http://hdl.handle.net/10.1080/01621459.2014.900250 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1581-1591 Template-Type: ReDIF-Article 1.0 Author-Name: Bradley Jones Author-X-Name-First: Bradley Author-X-Name-Last: Jones Author-Name: Dibyen Majumdar Author-X-Name-First: Dibyen Author-X-Name-Last: Majumdar Title: Optimal Supersaturated Designs Abstract: We consider screening experiments where an investigator wishes to study many factors using fewer observations. Our focus is on experiments with two-level factors and a main effects model with intercept. Since the number of parameters is larger than the number of observations, traditional methods of inference and design are unavailable. In 1959, Box suggested the use of supersaturated designs and in 1962, Booth and Cox introduced measures for efficiency of these designs including E(s-super-2), which is the average of squares of the off-diagonal entries of the information matrix, ignoring the intercept. For a design to be E(s-super-2)-optimal, the main effect of every factor must be orthogonal to the intercept (factors are balanced), and among all designs that satisfy this condition, it should minimize E(s-super-2). This is a natural approach since it identifies the most nearly orthogonal design, and orthogonal designs enjoy many desirable properties including efficient parameter estimation. Factor balance in an E(s-super-2)-optimal design has the consequence that the intercept is the most precisely estimated parameter. We introduce and study UE(s-super-2)-optimality, which is essentially the same as E(s-super-2)-optimality, except that we do not insist on factor balance. We also provide a method of construction. We introduce a second criterion from a traditional design optimality theory viewpoint. We use minimization of bias as our estimation criterion, and minimization of the variance of the minimum bias estimator as the design optimality criterion. Using D-optimality as the specific design optimality criterion, we introduce D-optimal supersaturated designs. We show that D-optimal supersaturated designs can be constructed from D-optimal chemical balance weighing designs obtained by Galil and Kiefer (1980, 1982), Cheng (1980) and other authors. It turns out that, except when the number of observations and the number of factors are in a certain range, an UE(s-super-2)-optimal design is also a D-optimal supersaturated design. Moreover, these designs have an interesting connection to Bayes optimal designs. When the prior variance is large enough, a D-optimal supersaturated design is Bayes D-optimal and when the prior variance is small enough, an UE(s-super-2)-optimal design is Bayes D-optimal. While E(s-super-2)-optimal designs yield precise intercept estimates, our study indicates that UE(s-super-2)-optimal designs generally produce more efficient estimates for the main effects of the factors. Based on theoretical properties and the study of examples, we recommend UE(s-super-2)-optimal designs for screening experiments. Journal: Journal of the American Statistical Association Pages: 1592-1600 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.938810 File-URL: http://hdl.handle.net/10.1080/01621459.2014.938810 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1592-1600 Template-Type: ReDIF-Article 1.0 Author-Name: Alberto Abadie Author-X-Name-First: Alberto Author-X-Name-Last: Abadie Author-Name: Guido W. Imbens Author-X-Name-First: Guido W. Author-X-Name-Last: Imbens Author-Name: Fanyin Zheng Author-X-Name-First: Fanyin Author-X-Name-Last: Zheng Title: Inference for Misspecified Models With Fixed Regressors Abstract: Following the work by Eicker, Huber, and White it is common in empirical work to report standard errors that are robust against general misspecification. In a regression setting, these standard errors are valid for the parameter that minimizes the squared difference between the conditional expectation and a linear approximation, averaged over the population distribution of the covariates. Here, we discuss an alternative parameter that corresponds to the approximation to the conditional expectation based on minimization of the squared difference averaged over the sample, rather than the population, distribution of the covariates. We argue that in some cases this may be a more interesting parameter. We derive the asymptotic variance for this parameter, which is generally smaller than the Eicker-Huber-White robust variance, and propose a consistent estimator for this asymptotic variance. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1601-1614 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.928218 File-URL: http://hdl.handle.net/10.1080/01621459.2014.928218 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1601-1614 Template-Type: ReDIF-Article 1.0 Author-Name: Michael Rosenthal Author-X-Name-First: Michael Author-X-Name-Last: Rosenthal Author-Name: Wei Wu Author-X-Name-First: Wei Author-X-Name-Last: Wu Author-Name: Eric Klassen Author-X-Name-First: Eric Author-X-Name-Last: Klassen Author-Name: Anuj Srivastava Author-X-Name-First: Anuj Author-X-Name-Last: Srivastava Title: Spherical Regression Models Using Projective Linear Transformations Abstract: This article studies the problem of modeling relationship between two spherical (or directional) random variables in a regression setup. Here the predictor and the response variables are constrained to be on a unit sphere and, due to this nonlinear condition, the standard Euclidean regression models do not apply. Several past papers have studied this problem, termed spherical regression, by modeling the response variable with a von Mises-Fisher (VMF) density with the mean given by a rotation of the predictor variable. The few papers that go beyond rigid rotations are limited to one- or two-dimensional spheres. This article extends the mean transformations to a larger group--the projective linear group of transformations--on unit spheres of arbitrary dimensions, while keeping the VMF density to model the noise. It develops a Newton-Raphson algorithm on the special linear group for estimating the MLE of regression parameter and establishes its asymptotic properties when the sample-size becomes large. Through a variety of experiments, using data taken from projective shape analysis, cloud tracking, etc., and some simulations, this article demonstrates improvements in the prediction and modeling performance of the proposed framework over previously used models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1615-1624 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.892881 File-URL: http://hdl.handle.net/10.1080/01621459.2014.892881 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1615-1624 Template-Type: ReDIF-Article 1.0 Author-Name: Jing Ning Author-X-Name-First: Jing Author-X-Name-Last: Ning Author-Name: Jing Qin Author-X-Name-First: Jing Author-X-Name-Last: Qin Author-Name: Yu Shen Author-X-Name-First: Yu Author-X-Name-Last: Shen Title: Score Estimating Equations from Embedded Likelihood Functions Under Accelerated Failure Time Model Abstract: The semiparametric accelerated failure time (AFT) model is one of the most popular models for analyzing time-to-event outcomes. One appealing feature of the AFT model is that the observed failure time data can be transformed to identically independent distributed random variables without covariate effects. We describe a class of estimating equations based on the score functions for the transformed data, which are derived from the full likelihood function under commonly used semiparametric models such as the proportional hazards or proportional odds model. The methods of estimating regression parameters under the AFT model can be applied to traditional right-censored survival data as well as more complex time-to-event data subject to length-biased sampling. We establish the asymptotic properties and evaluate the small sample performance of the proposed estimators. We illustrate the proposed methods through applications in two examples. Journal: Journal of the American Statistical Association Pages: 1625-1635 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.946034 File-URL: http://hdl.handle.net/10.1080/01621459.2014.946034 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1625-1635 Template-Type: ReDIF-Article 1.0 Author-Name: Xiao Song Author-X-Name-First: Xiao Author-X-Name-Last: Song Author-Name: Ching-Yun Wang Author-X-Name-First: Ching-Yun Author-X-Name-Last: Wang Title: Proportional Hazards Model With Covariate Measurement Error and Instrumental Variables Abstract: In biomedical studies, covariates with measurement error may occur in survival data. Existing approaches mostly require certain replications on the error-contaminated covariates, which may not be available in the data. In this article, we develop a simple nonparametric correction approach for estimation of the regression parameters in the proportional hazards model using a subset of the sample where instrumental variables are observed. The instrumental variables are related to the covariates through a general nonparametric model, and no distributional assumptions are placed on the error and the underlying true covariates. We further propose a novel generalized methods of moments nonparametric correction estimator to improve the efficiency over the simple correction approach. The efficiency gain can be substantial when the calibration subsample is small compared to the whole sample. The estimators are shown to be consistent and asymptotically normal. Performance of the estimators is evaluated via simulation studies and by an application to data from an HIV clinical trial. Estimation of the baseline hazard function is not addressed. Journal: Journal of the American Statistical Association Pages: 1636-1646 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.896805 File-URL: http://hdl.handle.net/10.1080/01621459.2014.896805 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1636-1646 Template-Type: ReDIF-Article 1.0 Author-Name: Jenný Brynjarsdóttir Author-X-Name-First: Jenný Author-X-Name-Last: Brynjarsdóttir Author-Name: L. Mark Berliner Author-X-Name-First: L. Mark Author-X-Name-Last: Berliner Title: Dimension-Reduced Modeling of Spatio-Temporal Processes Abstract: The field of spatial and spatio-temporal statistics is increasingly faced with the challenge of very large datasets. The classical approach to spatial and spatio-temporal modeling is very computationally demanding when datasets are large, which has led to interest in methods that use dimension-reduction techniques. In this article, we focus on modeling of two spatio-temporal processes where the primary goal is to predict one process from the other and where datasets for both processes are large. We outline a general dimension-reduced Bayesian hierarchical modeling approach where spatial structures of both processes are modeled in terms of a low number of basis vectors, hence reducing the spatial dimension of the problem. Temporal evolution of the processes and their dependence is then modeled through the coefficients of the basis vectors. We present a new method of obtaining data-dependent basis vectors, which is geared toward the goal of predicting one process from the other. We apply these methods to a statistical downscaling example, where surface temperatures on a coarse grid over Antarctica are downscaled onto a finer grid. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1647-1659 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.904232 File-URL: http://hdl.handle.net/10.1080/01621459.2014.904232 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1647-1659 Template-Type: ReDIF-Article 1.0 Author-Name: Brian Claggett Author-X-Name-First: Brian Author-X-Name-Last: Claggett Author-Name: Minge Xie Author-X-Name-First: Minge Author-X-Name-Last: Xie Author-Name: Lu Tian Author-X-Name-First: Lu Author-X-Name-Last: Tian Title: Meta-Analysis With Fixed, Unknown, Study-Specific Parameters Abstract: Meta-analysis is a valuable tool for combining information from independent studies. However, most common meta-analysis techniques rely on distributional assumptions that are difficult, if not impossible, to verify. For instance, in the commonly used fixed-effects and random-effects models, we take for granted that the underlying study-level parameters are either exactly the same across individual studies or that they are realizations of a random sample from a population, often under a parametric distributional assumption. In this article, we present a new framework for summarizing information obtained from multiple studies and make inference that is not dependent on any distributional assumption for the study-level parameters. Specifically, we assume the study-level parameters are unknown, fixed parameters and draw inferences about, for example, the quantiles of this set of parameters using study-specific summary statistics. This type of problem is known to be quite challenging (see Hall and Miller). We use a novel resampling method via the confidence distributions of the study-level parameters to construct confidence intervals for the above quantiles. We justify the validity of the interval estimation procedure asymptotically and compare the new procedure with the standard bootstrapping method. We also illustrate our proposal with the data from a recent meta-analysis of the treatment effect from an antioxidant on the prevention of contrast-induced nephropathy. Journal: Journal of the American Statistical Association Pages: 1660-1671 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.957288 File-URL: http://hdl.handle.net/10.1080/01621459.2014.957288 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1660-1671 Template-Type: ReDIF-Article 1.0 Author-Name: Hongyu Miao Author-X-Name-First: Hongyu Author-X-Name-Last: Miao Author-Name: Hulin Wu Author-X-Name-First: Hulin Author-X-Name-Last: Wu Author-Name: Hongqi Xue Author-X-Name-First: Hongqi Author-X-Name-Last: Xue Title: Generalized Ordinary Differential Equation Models Abstract: Existing estimation methods for ordinary differential equation (ODE) models are not applicable to discrete data. The generalized ODE (GODE) model is therefore proposed and investigated for the first time. We develop the likelihood-based parameter estimation and inference methods for GODE models. We propose robust computing algorithms and rigorously investigate the asymptotic properties of the proposed estimator by considering both measurement errors and numerical errors in solving ODEs. The simulation study and application of our methods to an influenza viral dynamics study suggest that the proposed methods have a superior performance in terms of accuracy over the existing ODE model estimation approach and the extended smoothing-based (ESB) method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1672-1682 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.957287 File-URL: http://hdl.handle.net/10.1080/01621459.2014.957287 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1672-1682 Template-Type: ReDIF-Article 1.0 Author-Name: Yunzhang Zhu Author-X-Name-First: Yunzhang Author-X-Name-Last: Zhu Author-Name: Xiaotong Shen Author-X-Name-First: Xiaotong Author-X-Name-Last: Shen Author-Name: Wei Pan Author-X-Name-First: Wei Author-X-Name-Last: Pan Title: Structural Pursuit Over Multiple Undirected Graphs Abstract: Gaussian graphical models are useful to analyze and visualize conditional dependence relationships between interacting units. Motivated from network analysis under different experimental conditions, such as gene networks for disparate cancer subtypes, we model structural changes over multiple networks with possible heterogeneities. In particular, we estimate multiple precision matrices describing dependencies among interacting units through maximum penalized likelihood. Of particular interest are homogeneous groups of similar entries across and zero-entries of these matrices, referred to as clustering and sparseness structures, respectively. A nonconvex method is proposed to seek a sparse representation for each matrix and identify clusters of the entries across the matrices. Computationally, we develop an efficient method on the basis of difference convex programming, the augmented Lagrangian method and the blockwise coordinate descent method, which is scalable to hundreds of graphs of thousands nodes through a simple necessary and sufficient partition rule, which divides nodes into smaller disjoint subproblems excluding zero-coefficients nodes for arbitrary graphs with convex relaxation. Theoretically, a finite-sample error bound is derived for the proposed method to reconstruct the clustering and sparseness structures. This leads to consistent reconstruction of these two structures simultaneously, permitting the number of unknown parameters to be exponential in the sample size, and yielding the optimal performance of the oracle estimator as if the true structures were given a priori. Simulation studies suggest that the method enjoys the benefit of pursuing these two disparate kinds of structures, and compares favorably against its convex counterpart in the accuracy of structure pursuit and parameter estimation. Journal: Journal of the American Statistical Association Pages: 1683-1696 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.921182 File-URL: http://hdl.handle.net/10.1080/01621459.2014.921182 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1683-1696 Template-Type: ReDIF-Article 1.0 Author-Name: Markus Frölich Author-X-Name-First: Markus Author-X-Name-Last: Frölich Author-Name: Martin Huber Author-X-Name-First: Martin Author-X-Name-Last: Huber Title: Treatment Evaluation With Multiple Outcome Periods Under Endogeneity and Attrition Abstract: This article develops a nonparametric methodology for treatment evaluation with multiple outcome periods under treatment endogeneity and missing outcomes. We use instrumental variables, pretreatment characteristics, and short-term (or intermediate) outcomes to identify the average treatment effect on the outcomes of compliers (the subpopulation whose treatment reacts on the instrument) in multiple periods based on inverse probability weighting. Treatment selection and attrition may depend on both observed characteristics and the unobservable compliance type, which is possibly related to unobserved factors. We also provide a simulation study and apply our methods to the evaluation of a policy intervention targeting college achievement, where we find that controlling for attrition considerably affects the effect estimates. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1697-1711 Issue: 508 Volume: 109 Year: 2014 Month: 12 X-DOI: 10.1080/01621459.2014.896804 File-URL: http://hdl.handle.net/10.1080/01621459.2014.896804 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1697-1711 Template-Type: ReDIF-Article 1.0 Author-Name: Nathaniel Schenker Author-X-Name-First: Nathaniel Author-X-Name-Last: Schenker Title: Why Your Involvement Matters Abstract: The International Year of Statistics, 2013, focused on outreach in a wonderful way. As we celebrate the ASA's 175th anniversary in 2014, it is worthwhile to look inward as well and think about how to keep our association and profession strong, so that our successors will be able to celebrate the 275th anniversary. The ASA, with its long history, its fine staff and organization, and its financial resource base, is well positioned to serve the profession, and indeed society, and it is very successful at doing so. But the real measure of the health of our association is the size and level of engagement of its membership, whose participation is a major source of the ASA's strength. So, what is it that compels people to be members? One might argue that it is the tangible benefits that we receive in exchange for our dues--magazine and journal subscriptions, discounted meeting registrations, and so on. Although such benefits are attractive, I believe they are not the primary reasons people are ASA members. What compels people is the value they find through involvement in the association. Unlike benefits, which are objective, value is subjective, varying over time and varying from member to member or group to group. And unlike benefits, which can be listed as bullet points, value is best borne out in personal experiences. In this address, I will use experiences that ASA members have shared with me, along with experiences of my own, to paint a picture of the deep value that involvement in the ASA has provided. I also will challenge you to continue to find the extraordinary value available through involvement in our association. Journal: Journal of the American Statistical Association Pages: 1-5 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2015.1021616 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1021616 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:1-5 Template-Type: ReDIF-Article 1.0 Author-Name: Zhengyi Zhou Author-X-Name-First: Zhengyi Author-X-Name-Last: Zhou Author-Name: David S. Matteson Author-X-Name-First: David S. Author-X-Name-Last: Matteson Author-Name: Dawn B. Woodard Author-X-Name-First: Dawn B. Author-X-Name-Last: Woodard Author-Name: Shane G. Henderson Author-X-Name-First: Shane G. Author-X-Name-Last: Henderson Author-Name: Athanasios C. Micheas Author-X-Name-First: Athanasios C. Author-X-Name-Last: Micheas Title: A Spatio-Temporal Point Process Model for Ambulance Demand Abstract: Ambulance demand estimation at fine time and location scales is critical for fleet management and dynamic deployment. We are motivated by the problem of estimating the spatial distribution of ambulance demand in Toronto, Canada, as it changes over discrete 2 hr intervals. This large-scale dataset is sparse at the desired temporal resolutions and exhibits location-specific serial dependence, daily, and weekly seasonality. We address these challenges by introducing a novel characterization of time-varying Gaussian mixture models. We fix the mixture component distributions across all time periods to overcome data sparsity and accurately describe Toronto's spatial structure, while representing the complex spatio-temporal dynamics through time-varying mixture weights. We constrain the mixture weights to capture weekly seasonality, and apply a conditionally autoregressive prior on the mixture weights of each component to represent location-specific short-term serial dependence and daily seasonality. While estimation may be performed using a fixed number of mixture components, we also extend to estimate the number of components using birth-and-death Markov chain Monte Carlo. The proposed model is shown to give higher statistical predictive accuracy and to reduce the error in predicting emergency medical service operational performance by as much as two-thirds compared to a typical industry practice. Journal: Journal of the American Statistical Association Pages: 6-15 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.941466 File-URL: http://hdl.handle.net/10.1080/01621459.2014.941466 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:6-15 Template-Type: ReDIF-Article 1.0 Author-Name: Xu Tang Author-X-Name-First: Xu Author-X-Name-Last: Tang Author-Name: Fah F. Gan Author-X-Name-First: Fah F. Author-X-Name-Last: Gan Author-Name: Lingyun Zhang Author-X-Name-First: Lingyun Author-X-Name-Last: Zhang Title: Risk-Adjusted Cumulative Sum Charting Procedure Based on Multiresponses Abstract: The cumulative sum charting procedure is traditionally used in the manufacturing industry for monitoring the quality of products. Recently, it has been extended to monitoring surgical outcomes. Unlike a manufacturing process where the raw material is usually reasonably homogeneous, patients' risks of surgical failure are usually different. It has been proposed in the literature that the binary outcomes from a surgical procedure be adjusted using the preoperative risk based on a likelihood-ratio scoring method. Such a crude classification of surgical outcome is naive. It is unreasonable to regard a patient who has a full recovery, the same quality outcome as another patient who survived but remained bed-ridden for life. For a patient who survives an operation, there can be many different grades of recovery. Thus, it makes sense to consider a risk-adjusted cumulative sum charting procedure based on more than two outcomes to better monitor surgical performance. In this article, we develop such a chart and study its performance. Journal: Journal of the American Statistical Association Pages: 16-26 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.960965 File-URL: http://hdl.handle.net/10.1080/01621459.2014.960965 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:16-26 Template-Type: ReDIF-Article 1.0 Author-Name: Alexander M. Franks Author-X-Name-First: Alexander M. Author-X-Name-Last: Franks Author-Name: Gábor Csárdi Author-X-Name-First: Gábor Author-X-Name-Last: Csárdi Author-Name: D. Allan Drummond Author-X-Name-First: D. Allan Author-X-Name-Last: Drummond Author-Name: Edoardo M. Airoldi Author-X-Name-First: Edoardo M. Author-X-Name-Last: Airoldi Title: Estimating a Structured Covariance Matrix From Multilab Measurements in High-Throughput Biology Abstract: We consider the problem of quantifying the degree of coordination between transcription and translation, in yeast. Several studies have reported a surprising lack of coordination over the years, in organisms as different as yeast and humans, using diverse technologies. However, a close look at this literature suggests that the lack of reported correlation may not reflect the biology of regulation. These reports do not control for between-study biases and structure in the measurement errors, ignore key aspects of how the data connect to the estimand, and systematically underestimate the correlation as a consequence. Here, we design a careful meta-analysis of 27 yeast datasets, supported by a multilevel model, full uncertainty quantification, a suite of sensitivity analyses, and novel theory, to produce a more accurate estimate of the correlation between mRNA and protein levels--a proxy for coordination. From a statistical perspective, this problem motivates new theory on the impact of noise, model misspecifications, and nonignorable missing data on estimates of the correlation between high-dimensional responses. We find that the correlation between mRNA and protein levels is quite high under the studied conditions, in yeast, suggesting that post-transcriptional regulation plays a less prominent role than previously thought. Journal: Journal of the American Statistical Association Pages: 27-44 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.964404 File-URL: http://hdl.handle.net/10.1080/01621459.2014.964404 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:27-44 Template-Type: ReDIF-Article 1.0 Author-Name: Antonio R. Linero Author-X-Name-First: Antonio R. Author-X-Name-Last: Linero Author-Name: Michael J. Daniels Author-X-Name-First: Michael J. Author-X-Name-Last: Daniels Title: A Flexible Bayesian Approach to Monotone Missing Data in Longitudinal Studies With Nonignorable Missingness With Application to an Acute Schizophrenia Clinical Trial Abstract: We develop a Bayesian nonparametric model for a longitudinal response in the presence of nonignorable missing data. Our general approach is to first specify a working model that flexibly models the missingness and full outcome processes jointly. We specify a Dirichlet process mixture of missing at random (MAR) models as a prior on the joint distribution of the working model. This aspect of the model governs the fit of the observed data by modeling the observed data distribution as the marginalization over the missing data in the working model. We then separately specify the conditional distribution of the missing data given the observed data and dropout. This approach allows us to identify the distribution of the missing data using identifying restrictions as a starting point. We propose a framework for introducing sensitivity parameters, allowing us to vary the untestable assumptions about the missing data mechanism smoothly. Informative priors on the space of missing data assumptions can be specified to combine inferences under many different assumptions into a final inference and accurately characterize uncertainty. These methods are motivated by, and applied to, data from a clinical trial assessing the efficacy of a new treatment for acute schizophrenia. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 45-55 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.969424 File-URL: http://hdl.handle.net/10.1080/01621459.2014.969424 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:45-55 Template-Type: ReDIF-Article 1.0 Author-Name: Giwhyun Lee Author-X-Name-First: Giwhyun Author-X-Name-Last: Lee Author-Name: Yu Ding Author-X-Name-First: Yu Author-X-Name-Last: Ding Author-Name: Marc G. Genton Author-X-Name-First: Marc G. Author-X-Name-Last: Genton Author-Name: Le Xie Author-X-Name-First: Le Author-X-Name-Last: Xie Title: Power Curve Estimation With Multivariate Environmental Factors for Inland and Offshore Wind Farms Abstract: In the wind industry, a power curve refers to the functional relationship between the power output generated by a wind turbine and the wind speed at the time of power generation. Power curves are used in practice for a number of important tasks including predicting wind power production and assessing a turbine's energy production efficiency. Nevertheless, actual wind power data indicate that the power output is affected by more than just wind speed. Several other environmental factors, such as wind direction, air density, humidity, turbulence intensity, and wind shears, have potential impact. Yet, in industry practice, as well as in the literature, current power curve models primarily consider wind speed and, sometimes, wind speed and direction. We propose an additive multivariate kernel method that can include the aforementioned environmental factors as a new power curve model. Our model provides, conditional on a given environmental condition, both the point estimation and density estimation of power output. It is able to capture the nonlinear relationships between environmental factors and the wind power output, as well as the high-order interaction effects among some of the environmental factors. Using operational data associated with four turbines in an inland wind farm and two turbines in an offshore wind farm, we demonstrate the improvement achieved by our kernel method. Journal: Journal of the American Statistical Association Pages: 56-67 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.977385 File-URL: http://hdl.handle.net/10.1080/01621459.2014.977385 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:56-67 Template-Type: ReDIF-Article 1.0 Author-Name: Curtis B. Storlie Author-X-Name-First: Curtis B. Author-X-Name-Last: Storlie Author-Name: William A. Lane Author-X-Name-First: William A. Author-X-Name-Last: Lane Author-Name: Emily M. Ryan Author-X-Name-First: Emily M. Author-X-Name-Last: Ryan Author-Name: James R. Gattiker Author-X-Name-First: James R. Author-X-Name-Last: Gattiker Author-Name: David M. Higdon Author-X-Name-First: David M. Author-X-Name-Last: Higdon Title: Calibration of Computational Models With Categorical Parameters and Correlated Outputs via Bayesian Smoothing Spline ANOVA Abstract: It has become commonplace to use complex computer models to predict outcomes in regions where data do not exist. Typically these models need to be calibrated and validated using some experimental data, which often consists of multiple correlated outcomes. In addition, some of the model parameters may be categorical in nature, such as a pointer variable to alternate models (or submodels) for some of the physics of the system. Here, we present a general approach for calibration in such situations where an emulator of the computationally demanding models and a discrepancy term from the model to reality are represented within a Bayesian smoothing spline (BSS) ANOVA framework. The BSS-ANOVA framework has several advantages over the traditional Gaussian process, including ease of handling categorical inputs and correlated outputs, and improved computational efficiency. Finally, this framework is then applied to the problem that motivated its design; a calibration of a computational fluid dynamics (CFD) model of a bubbling fluidized which is used as an absorber in a CO2 capture system. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 68-82 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.979993 File-URL: http://hdl.handle.net/10.1080/01621459.2014.979993 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:68-82 Template-Type: ReDIF-Article 1.0 Author-Name: Kentaro Fukumoto Author-X-Name-First: Kentaro Author-X-Name-Last: Fukumoto Title: What Happens Depends on When It Happens: Copula-Based Ordered Event History Analysis of Civil War Duration and Outcome Abstract: Scholars are interested in not just what event happens but also when the event happens. If there is dependence among events or dependence between time and events, however, the currently common methods (e.g., competing risks approaches) produce biased estimates. To deal with these problems, this article proposes a new method of copula-based ordered event history analysis (COEHA). A merit of working with copulas is that, whatever marginal distributions time and event variables follow (including the Cox model), researchers can derive whatever joint distribution exists between the two. Application of the COEHA model to a dataset from civil wars supports two controversial hypotheses. First, as wars become longer, rebel victory becomes more likely but settlement does not (there is dependence between time and events at both tails). Second, stronger rebels make wars shorter but do not necessarily tend to win, as experts predict but fail to establish (rebels' strength shortens time but has no effect on which events occur). Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 83-92 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.979994 File-URL: http://hdl.handle.net/10.1080/01621459.2014.979994 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:83-92 Template-Type: ReDIF-Article 1.0 Author-Name: Tingting Zhang Author-X-Name-First: Tingting Author-X-Name-Last: Zhang Author-Name: Jingwei Wu Author-X-Name-First: Jingwei Author-X-Name-Last: Wu Author-Name: Fan Li Author-X-Name-First: Fan Author-X-Name-Last: Li Author-Name: Brian Caffo Author-X-Name-First: Brian Author-X-Name-Last: Caffo Author-Name: Dana Boatman-Reich Author-X-Name-First: Dana Author-X-Name-Last: Boatman-Reich Title: A Dynamic Directional Model for Effective Brain Connectivity Using Electrocorticographic (ECoG) Time Series Abstract: We introduce a dynamic directional model (DDM) for studying brain effective connectivity based on intracranial electrocorticographic (ECoG) time series. The DDM consists of two parts: a set of differential equations describing neuronal activity of brain components (state equations), and observation equations linking the underlying neuronal states to observed data. When applied to functional MRI or EEG data, DDMs usually have complex formulations and thus can accommodate only a few regions, due to limitations in spatial resolution and/or temporal resolution of these imaging modalities. In contrast, we formulate our model in the context of ECoG data. The combined high temporal and spatial resolution of ECoG data result in a much simpler DDM, allowing investigation of complex connections between many regions. To identify functionally segregated subnetworks, a form of biologically economical brain networks, we propose the Potts model for the DDM parameters. The neuronal states of brain components are represented by cubic spline bases and the parameters are estimated by minimizing a log-likelihood criterion that combines the state and observation equations. The Potts model is converted to the Potts penalty in the penalized regression approach to achieve sparsity in parameter estimation, for which a fast iterative algorithm is developed. The methods are applied to an auditory ECoG dataset. Journal: Journal of the American Statistical Association Pages: 93-106 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.988213 File-URL: http://hdl.handle.net/10.1080/01621459.2014.988213 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:93-106 Template-Type: ReDIF-Article 1.0 Author-Name: Weibing Huang Author-X-Name-First: Weibing Author-X-Name-Last: Huang Author-Name: Charles-Albert Lehalle Author-X-Name-First: Charles-Albert Author-X-Name-Last: Lehalle Author-Name: Mathieu Rosenbaum Author-X-Name-First: Mathieu Author-X-Name-Last: Rosenbaum Title: Simulating and Analyzing Order Book Data: The Queue-Reactive Model Abstract: Through the analysis of a dataset of ultra high frequency order book updates, we introduce a model which accommodates the empirical properties of the full order book together with the stylized facts of lower frequency financial data. To do so, we split the time interval of interest into periods in which a well chosen reference price, typically the midprice, remains constant. Within these periods, we view the limit order book as a Markov queuing system. Indeed, we assume that the intensities of the order flows only depend on the current state of the order book. We establish the limiting behavior of this model and estimate its parameters from market data. Then, to design a relevant model for the whole period of interest, we use a stochastic mechanism that allows to switch from one period of constant reference price to another. Beyond enabling to reproduce accurately the behavior of market data, we show that our framework can be very useful for practitioners, notably as a market simulator or as a tool for the transaction cost analysis of complex trading algorithms. Journal: Journal of the American Statistical Association Pages: 107-122 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.982278 File-URL: http://hdl.handle.net/10.1080/01621459.2014.982278 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:107-122 Template-Type: ReDIF-Article 1.0 Author-Name: Matthew J. Heaton Author-X-Name-First: Matthew J. Author-X-Name-Last: Heaton Author-Name: Stephan R. Sain Author-X-Name-First: Stephan R. Author-X-Name-Last: Sain Author-Name: Andrew J. Monaghan Author-X-Name-First: Andrew J. Author-X-Name-Last: Monaghan Author-Name: Olga V. Wilhelmi Author-X-Name-First: Olga V. Author-X-Name-Last: Wilhelmi Author-Name: Mary H. Hayden Author-X-Name-First: Mary H. Author-X-Name-Last: Hayden Title: An Analysis of an Incomplete Marked Point Pattern of Heat-Related 911 Calls Abstract: We analyze an incomplete marked point pattern of heat-related 911 calls between the years 2006-2010 in Houston, TX, to primarily investigate conditions that are associated with increased vulnerability to heat-related morbidity and, secondarily, build a statistical model that can be used as a public health tool to predict the volume of 911 calls given a time frame and heat exposure. We model the calls as arising from a nonhomogenous Cox process with unknown intensity measure. By using the kernel convolution construction of a Gaussian process, the intensity surface is modeled using a low-dimensional representation and properly adheres to circular domain constraints. We account for the incomplete observations by marginalizing the joint intensity measure over the domain of the missing marks and also demonstrate model based imputation. We find that spatial regions of high risk for heat-related 911 calls are temporally dynamic with the highest risk occurring in urban areas during the day. We also find that elderly populations have an increased probability of calling 911 with heat-related issues than younger populations. Finally, the age of individuals and hour of the day with the highest intensity of heat-related 911 calls varies by race/ethnicity. Supplementary materials are included with this article. Journal: Journal of the American Statistical Association Pages: 123-135 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.983229 File-URL: http://hdl.handle.net/10.1080/01621459.2014.983229 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:123-135 Template-Type: ReDIF-Article 1.0 Author-Name: J. L. Scealy Author-X-Name-First: J. L. Author-X-Name-Last: Scealy Author-Name: Patrice de Caritat Author-X-Name-First: Patrice Author-X-Name-Last: de Caritat Author-Name: Eric C. Grunsky Author-X-Name-First: Eric C. Author-X-Name-Last: Grunsky Author-Name: Michail T. Tsagris Author-X-Name-First: Michail T. Author-X-Name-Last: Tsagris Author-Name: A. H. Welsh Author-X-Name-First: A. H. Author-X-Name-Last: Welsh Title: Robust Principal Component Analysis for Power Transformed Compositional Data Abstract: Geochemical surveys collect sediment or rock samples, measure the concentration of chemical elements, and report these typically either in weight percent or in parts per million (ppm). There are usually a large number of elements measured and the distributions are often skewed, containing many potential outliers. We present a new robust principal component analysis (PCA) method for geochemical survey data, that involves first transforming the compositional data onto a manifold using a relative power transformation. A flexible set of moment assumptions are made which take the special geometry of the manifold into account. The Kent distribution moment structure arises as a special case when the chosen manifold is the hypersphere. We derive simple moment and robust estimators (RO) of the parameters which are also applicable in high-dimensional settings. The resulting PCA based on these estimators is done in the tangent space and is related to the power transformation method used in correspondence analysis. To illustrate, we analyze major oxide data from the National Geochemical Survey of Australia. When compared with the traditional approach in the literature based on the centered log-ratio transformation, the new PCA method is shown to be more successful at dimension reduction and gives interpretable results. Journal: Journal of the American Statistical Association Pages: 136-148 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.990563 File-URL: http://hdl.handle.net/10.1080/01621459.2014.990563 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:136-148 Template-Type: ReDIF-Article 1.0 Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Xin Tong Author-X-Name-First: Xin Author-X-Name-Last: Tong Author-Name: Yao Zeng Author-X-Name-First: Yao Author-X-Name-Last: Zeng Title: Multi-Agent Inference in Social Networks: A Finite Population Learning Approach Abstract: When people in a society want to make inference about some parameter, each person may want to use data collected by other people. Information (data) exchange in social networks is usually costly, so to make reliable statistical decisions, people need to weigh the benefits and costs of information acquisition. Conflicts of interests and coordination problems will arise in the process. Classical statistics does not consider people's incentives and interactions in the data-collection process. To address this imperfection, this work explores multi-agent Bayesian inference problems with a game theoretic social network model. Motivated by our interest in aggregate inference at the societal level, we propose a new concept, finite population learning, to address whether with high probability, a large fraction of people in a given finite population network can make "good" inference. Serving as a foundation, this concept enables us to study the long run trend of aggregate inference quality as population grows. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 149-158 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.893885 File-URL: http://hdl.handle.net/10.1080/01621459.2014.893885 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:149-158 Template-Type: ReDIF-Article 1.0 Author-Name: Christine Peterson Author-X-Name-First: Christine Author-X-Name-Last: Peterson Author-Name: Francesco C. Stingo Author-X-Name-First: Francesco C. Author-X-Name-Last: Stingo Author-Name: Marina Vannucci Author-X-Name-First: Marina Author-X-Name-Last: Vannucci Title: Bayesian Inference of Multiple Gaussian Graphical Models Abstract: In this article, we propose a Bayesian approach to inference on multiple Gaussian graphical models. Specifically, we address the problem of inferring multiple undirected networks in situations where some of the networks may be unrelated, while others share common features. We link the estimation of the graph structures via a Markov random field (MRF) prior, which encourages common edges. We learn which sample groups have a shared graph structure by placing a spike-and-slab prior on the parameters that measure network relatedness. This approach allows us to share information between sample groups, when appropriate, as well as to obtain a measure of relative network similarity across groups. Our modeling framework incorporates relevant prior knowledge through an edge-specific informative prior and can encourage similarity to an established network. Through simulations, we demonstrate the utility of our method in summarizing relative network similarity and compare its performance against related methods. We find improved accuracy of network estimation, particularly when the sample sizes within each subgroup are moderate. We also illustrate the application of our model to infer protein networks for various cancer subtypes and under different experimental conditions. Journal: Journal of the American Statistical Association Pages: 159-174 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.896806 File-URL: http://hdl.handle.net/10.1080/01621459.2014.896806 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:159-174 Template-Type: ReDIF-Article 1.0 Author-Name: Zheng Tracy Ke Author-X-Name-First: Zheng Tracy Author-X-Name-Last: Ke Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Yichao Wu Author-X-Name-First: Yichao Author-X-Name-Last: Wu Title: Homogeneity Pursuit Abstract: This article explores the homogeneity of coefficients in high-dimensional regression, which extends the sparsity concept and is more general and suitable for many applications. Homogeneity arises when regression coefficients corresponding to neighboring geographical regions or a similar cluster of covariates are expected to be approximately the same. Sparsity corresponds to a special case of homogeneity with a large cluster of known atom zero. In this article, we propose a new method called clustering algorithm in regression via data-driven segmentation (CARDS) to explore homogeneity. New mathematics are provided on the gain that can be achieved by exploring homogeneity. Statistical properties of two versions of CARDS are analyzed. In particular, the asymptotic normality of our proposed CARDS estimator is established, which reveals better estimation accuracy for homogeneous parameters than that without homogeneity exploration. When our methods are combined with sparsity exploration, further efficiency can be achieved beyond the exploration of sparsity alone. This provides additional insights into the power of exploring low-dimensional structures in high-dimensional regression: homogeneity and sparsity. Our results also shed lights on the properties of the fused Lasso. The newly developed method is further illustrated by simulation studies and applications to real data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 175-194 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.892882 File-URL: http://hdl.handle.net/10.1080/01621459.2014.892882 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:175-194 Template-Type: ReDIF-Article 1.0 Author-Name: D. L. Borchers Author-X-Name-First: D. L. Author-X-Name-Last: Borchers Author-Name: B. C. Stevenson Author-X-Name-First: B. C. Author-X-Name-Last: Stevenson Author-Name: D. Kidney Author-X-Name-First: D. Author-X-Name-Last: Kidney Author-Name: L. Thomas Author-X-Name-First: L. Author-X-Name-Last: Thomas Author-Name: T. A. Marques Author-X-Name-First: T. A. Author-X-Name-Last: Marques Title: A Unifying Model for Capture-Recapture and Distance Sampling Surveys of Wildlife Populations Abstract: A fundamental problem in wildlife ecology and management is estimation of population size or density. The two dominant methods in this area are capture-recapture (CR) and distance sampling (DS), each with its own largely separate literature. We develop a class of models that synthesizes them. It accommodates a spectrum of models ranging from nonspatial CR models (with no information on animal locations) through to DS and mark-recapture distance sampling (MRDS) models, in which animal locations are observed without error. Between these lie spatially explicit capture-recapture (SECR) models that include only capture locations, and a variety of models with less location data than are typical of DS surveys but more than are normally used on SECR surveys. In addition to unifying CR and DS models, the class provides a means of improving inference from SECR models by adding supplementary location data, and a means of incorporating measurement error into DS and MRDS models. We illustrate their utility by comparing inference on acoustic surveys of gibbons and frogs using only capture locations, using estimated angles (gibbons) and combinations of received signal strength and time-of-arrival data (frogs), and on a visual MRDS survey of whales, comparing estimates with exact and estimated distances. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 195-204 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.893884 File-URL: http://hdl.handle.net/10.1080/01621459.2014.893884 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:195-204 Template-Type: ReDIF-Article 1.0 Author-Name: Paul R. Rosenbaum Author-X-Name-First: Paul R. Author-X-Name-Last: Rosenbaum Title: Bahadur Efficiency of Sensitivity Analyses in Observational Studies Abstract: An observational study draws inferences about treatment effects when treatments are not randomly assigned, as they would be in a randomized experiment. The naive analysis of an observational study assumes that adjustments for measured covariates suffice to remove bias from nonrandom treatment assignment. A sensitivity analysis in an observational study determines the magnitude of bias from nonrandom treatment assignment that would need to be present to alter the qualitative conclusions of the naive analysis, say leading to the acceptance of a null hypothesis rejected in the naive analysis. Observational studies vary greatly in their sensitivity to unmeasured biases, but a poor choice of test statistic can lead to an exaggerated report of sensitivity to bias. The Bahadur efficiency of a sensitivity analysis is introduced, calculated, and connected to established concepts, such as the power of a sensitivity analysis and the design sensitivity. The Bahadur slope equals zero when the sensitivity parameter equals the design sensitivity, but the Bahadur slope permits more refined distinctions. Specifically, the Bahadur relative efficiency can also compare the relative performance of two test statistics at a value of the sensitivity parameter below the minimum of their design sensitivities. Adaptive procedures that combine several tests can achieve the best design sensitivity and the best Bahadur slope of their component tests. Ultimately, in sufficiently large sample sizes, design sensitivity is more important than efficiency for the power of a sensitivity analysis, and the exponential rate at which rate design sensitivity overtakes efficiency is characterized. Journal: Journal of the American Statistical Association Pages: 205-217 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.960968 File-URL: http://hdl.handle.net/10.1080/01621459.2014.960968 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:205-217 Template-Type: ReDIF-Article 1.0 Author-Name: Marc Hallin Author-X-Name-First: Marc Author-X-Name-Last: Hallin Author-Name: Chintan Mehta Author-X-Name-First: Chintan Author-X-Name-Last: Mehta Title: R-Estimation for Asymmetric Independent Component Analysis Abstract: Independent component analysis (ICA) recently has attracted much attention in the statistical literature as an appealing alternative to elliptical models. Whereas k-dimensional elliptical densities depend on one single unspecified radial density, however, k-dimensional independent component distributions involve k unspecified component densities. In practice, for given sample size n and dimension k, this makes the statistical analysis much harder. We focus here on the estimation, from an independent sample, of the mixing/demixing matrix of the model. Traditional methods (FOBI, Kernel-ICA, FastICA) mainly originate from the engineering literature. Their consistency requires moment conditions, they are poorly robust, and do not achieve any type of asymptotic efficiency. When based on robust scatter matrices, the two-scatter methods developed by Oja, Sirkia, and Eriksson in 2006 and Nordhausen, Oja, and Ollila in 2008 enjoy better robustness features, but their optimality properties remain unclear. The "classical semiparametric" approach by Chen and Bickel in 2006, quite on the contrary, achieves semiparametric efficiency, but requires the estimation of the densities of the k unobserved independent components. As a reaction, an efficient (signed-)rank-based approach was proposed by Ilmonen and Paindaveine in 2011 for the case of symmetric component densities. The performance of their estimators is quite good, but they unfortunately fail to be root-n consistent as soon as one of the component densities violates the symmetry assumption. In this article, using ranks rather than signed ranks, we extend their approach to the asymmetric case and propose a one-step R-estimator for ICA mixing matrices. The finite-sample performances of those estimators are investigated and compared to those of existing methods under moderately large sample sizes. Particularly good performances are obtained from a version involving data-driven scores taking into account the skewness and kurtosis of residuals. Finally, we show, by an empirical exercise, that our methods also may provide excellent results in a context such as image analysis, where the basic assumptions of ICA are quite unlikely to hold. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 218-232 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.909316 File-URL: http://hdl.handle.net/10.1080/01621459.2014.909316 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:218-232 Template-Type: ReDIF-Article 1.0 Author-Name: Linglong Kong Author-X-Name-First: Linglong Author-X-Name-Last: Kong Author-Name: Douglas P. Wiens Author-X-Name-First: Douglas P. Author-X-Name-Last: Wiens Title: Model-Robust Designs for Quantile Regression Abstract: We give methods for the construction of designs for regression models, when the purpose of the investigation is the estimation of the conditional quantile function, and the estimation method is quantile regression. The designs are robust against misspecified response functions, and against unanticipated heteroscedasticity. The methods are illustrated by example, and in a case study in which they are applied to growth charts. Journal: Journal of the American Statistical Association Pages: 233-245 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.969427 File-URL: http://hdl.handle.net/10.1080/01621459.2014.969427 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:233-245 Template-Type: ReDIF-Article 1.0 Author-Name: Guodong Li Author-X-Name-First: Guodong Author-X-Name-Last: Li Author-Name: Yang Li Author-X-Name-First: Yang Author-X-Name-Last: Li Author-Name: Chih-Ling Tsai Author-X-Name-First: Chih-Ling Author-X-Name-Last: Tsai Title: Quantile Correlations and Quantile Autoregressive Modeling Abstract: In this article, we propose two important measures, quantile correlation (QCOR) and quantile partial correlation (QPCOR). We then apply them to quantile autoregressive (QAR) models, and introduce two valuable quantities, the quantile autocorrelation function (QACF) and the quantile partial autocorrelation function (QPACF). This allows us to extend the Box-Jenkins three-stage procedure (model identification, model parameter estimation, and model diagnostic checking) from classical autoregressive models to quantile autoregressive models. Specifically, the QPACF of an observed time series can be employed to identify the autoregressive order, while the QACF of residuals obtained from the fitted model can be used to assess the model adequacy. We not only demonstrate the asymptotic properties of QCOR and QPCOR, but also show the large sample results of QACF, QPACF, and the quantile version of the Box-Pierce test. Moreover, we obtain the bootstrap approximations to the distributions of parameter estimators and proposed measures. Simulation studies indicate that the proposed methods perform well in finite samples, and an empirical example is presented to illustrate usefulness. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 246-261 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.892007 File-URL: http://hdl.handle.net/10.1080/01621459.2014.892007 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:246-261 Template-Type: ReDIF-Article 1.0 Author-Name: Francis K. C. Hui Author-X-Name-First: Francis K. C. Author-X-Name-Last: Hui Author-Name: David I. Warton Author-X-Name-First: David I. Author-X-Name-Last: Warton Author-Name: Scott D. Foster Author-X-Name-First: Scott D. Author-X-Name-Last: Foster Title: Tuning Parameter Selection for the Adaptive Lasso Using ERIC Abstract: The adaptive Lasso is a commonly applied penalty for variable selection in regression modeling. Like all penalties though, its performance depends critically on the choice of the tuning parameter. One method for choosing the tuning parameter is via information criteria, such as those based on AIC and BIC. However, these criteria were developed for use with unpenalized maximum likelihood estimators, and it is not clear that they take into account the effects of penalization. In this article, we propose the extended regularized information criterion (ERIC) for choosing the tuning parameter in adaptive Lasso regression. ERIC extends the BIC to account for the effect of applying the adaptive Lasso on the bias-variance tradeoff. This leads to a criterion whose penalty for model complexity is itself a function of the tuning parameter. We show the tuning parameter chosen by ERIC is selection consistent when the number of variables grows with sample size, and that this consistency holds in a wider range of contexts compared to using BIC to choose the tuning parameter. Simulation show that ERIC can significantly outperform BIC and other information criteria proposed (for choosing the tuning parameter) in selecting the true model. For ultra high-dimensional data (p > n), we consider a two-stage approach combining sure independence screening with adaptive Lasso regression using ERIC, which is selection consistent and performs strongly in simulation. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 262-269 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.951444 File-URL: http://hdl.handle.net/10.1080/01621459.2014.951444 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:262-269 Template-Type: ReDIF-Article 1.0 Author-Name: Wei Lin Author-X-Name-First: Wei Author-X-Name-Last: Lin Author-Name: Rui Feng Author-X-Name-First: Rui Author-X-Name-Last: Feng Author-Name: Hongzhe Li Author-X-Name-First: Hongzhe Author-X-Name-Last: Li Title: Regularization Methods for High-Dimensional Instrumental Variables Regression With an Application to Genetical Genomics Abstract: In genetical genomics studies, it is important to jointly analyze gene expression data and genetic variants in exploring their associations with complex traits, where the dimensionality of gene expressions and genetic variants can both be much larger than the sample size. Motivated by such modern applications, we consider the problem of variable selection and estimation in high-dimensional sparse instrumental variables models. To overcome the difficulty of high dimensionality and unknown optimal instruments, we propose a two-stage regularization framework for identifying and estimating important covariate effects while selecting and estimating optimal instruments. The methodology extends the classical two-stage least squares estimator to high dimensions by exploiting sparsity using sparsity-inducing penalty functions in both stages. The resulting procedure is efficiently implemented by coordinate descent optimization. For the representative L1 regularization and a class of concave regularization methods, we establish estimation, prediction, and model selection properties of the two-stage regularized estimators in the high-dimensional setting where the dimensionality of covariates and instruments are both allowed to grow exponentially with the sample size. The practical performance of the proposed method is evaluated by simulation studies and its usefulness is illustrated by an analysis of mouse obesity data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 270-288 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.908125 File-URL: http://hdl.handle.net/10.1080/01621459.2014.908125 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:270-288 Template-Type: ReDIF-Article 1.0 Author-Name: Qiang Sun Author-X-Name-First: Qiang Author-X-Name-Last: Sun Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Author-Name: Yufeng Liu Author-X-Name-First: Yufeng Author-X-Name-Last: Liu Author-Name: Joseph G. Ibrahim Author-X-Name-First: Joseph G. Author-X-Name-Last: Ibrahim Title: SPReM: Sparse Projection Regression Model For High-Dimensional Linear Regression Abstract: The aim of this article is to develop a sparse projection regression modeling (SPReM) framework to perform multivariate regression modeling with a large number of responses and a multivariate covariate of interest. We propose two novel heritability ratios to simultaneously perform dimension reduction, response selection, estimation, and testing, while explicitly accounting for correlations among multivariate responses. Our SPReM is devised to specifically address the low statistical power issue of many standard statistical approaches, such as the Hotelling's T-super-2 test statistic or a mass univariate analysis, for high-dimensional data. We formulate the estimation problem of SPReM as a novel sparse unit rank projection (SURP) problem and propose a fast optimization algorithm for SURP. Furthermore, we extend SURP to the sparse multirank projection (SMURP) by adopting a sequential SURP approximation. Theoretically, we have systematically investigated the convergence properties of SURP and the convergence rate of SURP estimates. Our simulation results and real data analysis have shown that SPReM outperforms other state-of-the-art methods. Journal: Journal of the American Statistical Association Pages: 289-302 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.892008 File-URL: http://hdl.handle.net/10.1080/01621459.2014.892008 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:289-302 Template-Type: ReDIF-Article 1.0 Author-Name: Juan Shen Author-X-Name-First: Juan Author-X-Name-Last: Shen Author-Name: Xuming He Author-X-Name-First: Xuming Author-X-Name-Last: He Title: Inference for Subgroup Analysis With a Structured Logistic-Normal Mixture Model Abstract: In this article, we propose a statistical model for the purpose of identifying a subgroup that has an enhanced treatment effect as well as the variables that are predictive of the subgroup membership. The need for such subgroup identification arises in clinical trials and in market segmentation analysis. By using a structured logistic-normal mixture model, our proposed framework enables us to perform a confirmatory statistical test for the existence of subgroups, and at the same time, to construct predictive scores for the subgroup membership. The inferential procedure proposed in the article is built on the recent literature on hypothesis testing for Gaussian mixtures, but the structured logistic-normal mixture model enjoys some distinctive properties that are unavailable to the simpler Gaussian mixture models. With the bootstrap approximations, the proposed tests are shown to be powerful and, equally importantly, insensitive to the choice of tuning parameters. As an illustration, we analyze a dataset from the AIDS Clinical Trials Group 320 study and show how the proposed methodology can help detect a potential subgroup of AIDS patients who may react much more favorably to the addition of a protease inhibitor to a conventional regimen than other patients. Journal: Journal of the American Statistical Association Pages: 303-312 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.894763 File-URL: http://hdl.handle.net/10.1080/01621459.2014.894763 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:303-312 Template-Type: ReDIF-Article 1.0 Author-Name: Eben Kenah Author-X-Name-First: Eben Author-X-Name-Last: Kenah Title: Semiparametric Relative-Risk Regression for Infectious Disease Transmission Data Abstract: This article introduces semiparametric relative-risk regression models for infectious disease data. The units of analysis in these models are pairs of individuals at risk of transmission. The hazard of infectious contact from i to j consists of a baseline hazard multiplied by a relative risk function that can be a function of infectiousness covariates for i, susceptibliity covariates for j, and pairwise covariates. When who-infects-whom is observed, we derive a profile likelihood maximized over all possible baseline hazard functions that is similar to the Cox partial likelihood. When who-infects-whom is not observed, we derive an EM algorithm to maximize the profile likelihood integrated over all possible combinations of who-infected-whom. This extends the most important class of regression models in survival analysis to infectious disease epidemiology. These methods can be implemented in standard statistical software, and they will be able to address important scientific questions about emerging infectious diseases with greater clarity, flexibility, and rigor than current statistical methods allow. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 313-325 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.896807 File-URL: http://hdl.handle.net/10.1080/01621459.2014.896807 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:313-325 Template-Type: ReDIF-Article 1.0 Author-Name: Dungang Liu Author-X-Name-First: Dungang Author-X-Name-Last: Liu Author-Name: Regina Y. Liu Author-X-Name-First: Regina Y. Author-X-Name-Last: Liu Author-Name: Minge Xie Author-X-Name-First: Minge Author-X-Name-Last: Xie Title: Multivariate Meta-Analysis of Heterogeneous Studies Using Only Summary Statistics: Efficiency and Robustness Abstract: Meta-analysis has been widely used to synthesize evidence from multiple studies for common hypotheses or parameters of interest. However, it has not yet been fully developed for incorporating heterogeneous studies, which arise often in applications due to different study designs, populations, or outcomes. For heterogeneous studies, the parameter of interest may not be estimable for certain studies, and in such a case, these studies are typically excluded from conventional meta-analysis. The exclusion of part of the studies can lead to a nonnegligible loss of information. This article introduces a meta-analysis for heterogeneous studies by combining the confidence density functions derived from the summary statistics of individual studies, hence referred to as the CD approach. It includes all the studies in the analysis and makes use of all information, direct as well as indirect. Under a general likelihood inference framework, this new approach is shown to have several desirable properties, including: (i) it is asymptotically as efficient as the maximum likelihood approach using individual participant data (IPD) from all studies; (ii) unlike the IPD analysis, it suffices to use summary statistics to carry out the CD approach. Individual-level data are not required; and (iii) it is robust against misspecification of the working covariance structure of parameter estimates. Besides its own theoretical significance, the last property also substantially broadens the applicability of the CD approach. All the properties of the CD approach are further confirmed by data simulated from a randomized clinical trials setting as well as by real data on aircraft landing performance. Overall, one obtains a unifying approach for combining summary statistics, subsuming many of the existing meta-analysis methods as special cases. Journal: Journal of the American Statistical Association Pages: 326-340 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.899235 File-URL: http://hdl.handle.net/10.1080/01621459.2014.899235 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:326-340 Template-Type: ReDIF-Article 1.0 Author-Name: Shujie Ma Author-X-Name-First: Shujie Author-X-Name-Last: Ma Author-Name: Peter X.-K. Song Author-X-Name-First: Peter X.-K. Author-X-Name-Last: Song Title: Varying Index Coefficient Models Abstract: It has been a long history of using interactions in regression analysis to investigate alterations in covariate-effects on response variables. In this article, we aim to address two kinds of new challenges arising from the inclusion of such high-order effects in the regression model for complex data. The first kind concerns a situation where interaction effects of individual covariates are weak but those of combined covariates are strong, and the other kind pertains to the presence of nonlinear interactive effects directed by low-effect covariates. We propose a new class of semiparametric models with varying index coefficients, which enables us to model and assess nonlinear interaction effects between grouped covariates on the response variable. As a result, most of the existing semiparametric regression models are special cases of our proposed models. We develop a numerically stable and computationally fast estimation procedure using both profile least squares method and local fitting. We establish both estimation consistency and asymptotic normality for the proposed estimators of index coefficients as well as the oracle property for the nonparametric function estimator. In addition, a generalized likelihood ratio test is provided to test for the existence of interaction effects or the existence of nonlinear interaction effects. Our models and estimation methods are illustrated by simulation studies, and by an analysis of child growth data to evaluate alterations in growth rates incurred by mother's exposures to endocrine disrupting compounds during pregnancy. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 341-356 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.903185 File-URL: http://hdl.handle.net/10.1080/01621459.2014.903185 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:341-356 Template-Type: ReDIF-Article 1.0 Author-Name: Jianhua Hu Author-X-Name-First: Jianhua Author-X-Name-Last: Hu Author-Name: Hongjian Zhu Author-X-Name-First: Hongjian Author-X-Name-Last: Zhu Author-Name: Feifang Hu Author-X-Name-First: Feifang Author-X-Name-Last: Hu Title: A Unified Family of Covariate-Adjusted Response-Adaptive Designs Based on Efficiency and Ethics Abstract: Response-adaptive designs have recently attracted more and more attention in the literature because of its advantages in efficiency and medical ethics. To develop personalized medicine, covariate information plays an important role in both design and analysis of clinical trials. A challenge is how to incorporate covariate information in response-adaptive designs while considering issues of both efficiency and medical ethics. To address this problem, we propose a new and unified family of covariate-adjusted response-adaptive (CARA) designs based on two general measurements of efficiency and ethics. Important properties (including asymptotic properties) of the proposed procedures are studied under categorical covariates. This new family of designs not only introduces new desirable CARA designs, but also unifies several important designs in the literature. We demonstrate the proposed procedures through examples, simulations, and a discussion of related earlier work. Journal: Journal of the American Statistical Association Pages: 357-367 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.903846 File-URL: http://hdl.handle.net/10.1080/01621459.2014.903846 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:357-367 Template-Type: ReDIF-Article 1.0 Author-Name: Jiejun Du Author-X-Name-First: Jiejun Author-X-Name-Last: Du Author-Name: Ian L. Dryden Author-X-Name-First: Ian L. Author-X-Name-Last: Dryden Author-Name: Xianzheng Huang Author-X-Name-First: Xianzheng Author-X-Name-Last: Huang Title: Size and Shape Analysis of Error-Prone Shape Data Abstract: We consider the problem of comparing sizes and shapes of objects when landmark data are prone to measurement error. We show that naive implementation of ordinary Procrustes analysis that ignores measurement error can compromise inference. To account for measurement error, we propose the conditional score method for matching configurations, which guarantees consistent inference under mild model assumptions. The effects of measurement error on inference from naive Procrustes analysis and the performance of the proposed method are illustrated via simulation and application in three real data examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 368-379 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.908779 File-URL: http://hdl.handle.net/10.1080/01621459.2014.908779 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:368-379 Template-Type: ReDIF-Article 1.0 Author-Name: Alexander Aue Author-X-Name-First: Alexander Author-X-Name-Last: Aue Author-Name: Diogo Dubart Norinho Author-X-Name-First: Diogo Dubart Author-X-Name-Last: Norinho Author-Name: Siegfried Hörmann Author-X-Name-First: Siegfried Author-X-Name-Last: Hörmann Title: On the Prediction of Stationary Functional Time Series Abstract: This article addresses the prediction of stationary functional time series. Existing contributions to this problem have largely focused on the special case of first-order functional autoregressive processes because of their technical tractability and the current lack of advanced functional time series methodology. It is shown here how standard multivariate prediction techniques can be used in this context. The connection between functional and multivariate predictions is made precise for the important case of vector and functional autoregressions. The proposed method is easy to implement, making use of existing statistical software packages, and may, therefore, be attractive to a broader, possibly nonacademic, audience. Its practical applicability is enhanced through the introduction of a novel functional final prediction error model selection criterion that allows for an automatic determination of the lag structure and the dimensionality of the model. The usefulness of the proposed methodology is demonstrated in a simulation study and an application to environmental data, namely the prediction of daily pollution curves describing the concentration of particulate matter in ambient air. It is found that the proposed prediction method often significantly outperforms existing methods. Journal: Journal of the American Statistical Association Pages: 378-392 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.909317 File-URL: http://hdl.handle.net/10.1080/01621459.2014.909317 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:378-392 Template-Type: ReDIF-Article 1.0 Author-Name: Jessica Minnier Author-X-Name-First: Jessica Author-X-Name-Last: Minnier Author-Name: Ming Yuan Author-X-Name-First: Ming Author-X-Name-Last: Yuan Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Author-Name: Tianxi Cai Author-X-Name-First: Tianxi Author-X-Name-Last: Cai Title: Risk Classification With an Adaptive Naive Bayes Kernel Machine Model Abstract: Genetic studies of complex traits have uncovered only a small number of risk markers explaining a small fraction of heritability and adding little improvement to disease risk prediction. Standard single marker methods may lack power in selecting informative markers or estimating effects. Most existing methods also typically do not account for nonlinearity. Identifying markers with weak signals and estimating their joint effects among many noninformative markers remains challenging. One potential approach is to group markers based on biological knowledge such as gene structure. If markers in a group tend to have similar effects, proper usage of the group structure could improve power and efficiency in estimation. We propose a two-stage method relating markers to disease risk by taking advantage of known gene-set structures. Imposing a naive Bayes kernel machine (KM) model, we estimate gene-set specific risk models that relate each gene-set to the outcome in stage I. The KM framework efficiently models potentially nonlinear effects of predictors without requiring explicit specification of functional forms. In stage II, we aggregate information across gene-sets via a regularization procedure. Estimation and computational efficiency is further improved with kernel principal component analysis. Asymptotic results for model estimation and gene-set selection are derived and numerical studies suggest that the proposed procedure could outperform existing procedures for constructing genetic risk models. Journal: Journal of the American Statistical Association Pages: 393-404 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.908778 File-URL: http://hdl.handle.net/10.1080/01621459.2014.908778 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:393-404 Template-Type: ReDIF-Article 1.0 Author-Name: Nadja Klein Author-X-Name-First: Nadja Author-X-Name-Last: Klein Author-Name: Thomas Kneib Author-X-Name-First: Thomas Author-X-Name-Last: Kneib Author-Name: Stefan Lang Author-X-Name-First: Stefan Author-X-Name-Last: Lang Title: Bayesian Generalized Additive Models for Location, Scale, and Shape for Zero-Inflated and Overdispersed Count Data Abstract: Frequent problems in applied research preventing the application of the classical Poisson log-linear model for analyzing count data include overdispersion, an excess of zeros compared to the Poisson distribution, correlated responses, as well as complex predictor structures comprising nonlinear effects of continuous covariates, interactions or spatial effects. We propose a general class of Bayesian generalized additive models for zero-inflated and overdispersed count data within the framework of generalized additive models for location, scale, and shape where semiparametric predictors can be specified for several parameters of a count data distribution. As standard options for applied work we consider the zero-inflated Poisson, the negative binomial and the zero-inflated negative binomial distribution. The additive predictor specifications rely on basis function approximations for the different types of effects in combination with Gaussian smoothness priors. We develop Bayesian inference based on Markov chain Monte Carlo simulation techniques where suitable proposal densities are constructed based on iteratively weighted least squares approximations to the full conditionals. To ensure practicability of the inference, we consider theoretical properties like the involved question whether the joint posterior is proper. The proposed approach is evaluated in simulation studies and applied to count data arising from patent citations and claim frequencies in car insurances. For the comparison of models with respect to the distribution, we consider quantile residuals as an effective graphical device and scoring rules that allow us to quantify the predictive ability of the models. The deviance information criterion is used to select appropriate predictor specifications once a response distribution has been chosen. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 405-419 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.912955 File-URL: http://hdl.handle.net/10.1080/01621459.2014.912955 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:405-419 Template-Type: ReDIF-Article 1.0 Author-Name: Efstathia Bura Author-X-Name-First: Efstathia Author-X-Name-Last: Bura Author-Name: Liliana Forzani Author-X-Name-First: Liliana Author-X-Name-Last: Forzani Title: Sufficient Reductions in Regressions With Elliptically Contoured Inverse Predictors Abstract: There are two general approaches based on inverse regression for estimating the linear sufficient reductions for the regression of Y on X: the moment-based approach such as SIR, PIR, SAVE, and DR, and the likelihood-based approach such as principal fitted components (PFC) and likelihood acquired directions (LAD) when the inverse predictors, X&7CY, are normal. By construction, these methods extract information from the first two conditional moments of X&7CY; they can only estimate linear reductions and thus form the linear sufficient dimension reduction (SDR) methodology. When var(X&7CY) is constant, E(X&7CY) contains the reduction and it can be estimated using PFC. When var(X&7CY) is nonconstant, PFC misses the information in the variance and second moment based methods (SAVE, DR, LAD) are used instead, resulting in efficiency loss in the estimation of the mean-based reduction. In this article we prove that (a) if X&7CY is elliptically contoured with parameters and density gY, there is no linear nontrivial sufficient reduction except if gY is the normal density with constant variance; (b) for nonnormal elliptically contoured data, all existing linear SDR methods only estimate part of the reduction; (c) a sufficient reduction of X for the regression of Y on X comprises of a linear and a nonlinear component. Journal: Journal of the American Statistical Association Pages: 420-434 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.914440 File-URL: http://hdl.handle.net/10.1080/01621459.2014.914440 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:420-434 Template-Type: ReDIF-Article 1.0 Author-Name: P. Richard Hahn Author-X-Name-First: P. Richard Author-X-Name-Last: Hahn Author-Name: Carlos M. Carvalho Author-X-Name-First: Carlos M. Author-X-Name-Last: Carvalho Title: Decoupling Shrinkage and Selection in Bayesian Linear Models: A Posterior Summary Perspective Abstract: Selecting a subset of variables for linear models remains an active area of research. This article reviews many of the recent contributions to the Bayesian model selection and shrinkage prior literature. A posterior variable selection summary is proposed, which distills a full posterior distribution over regression coefficients into a sequence of sparse linear predictors. Journal: Journal of the American Statistical Association Pages: 435-448 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2014.993077 File-URL: http://hdl.handle.net/10.1080/01621459.2014.993077 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:435-448 Template-Type: ReDIF-Article 1.0 Author-Name: Stephen E. Fienberg Author-X-Name-First: Stephen E. Author-X-Name-Last: Fienberg Author-Name: James S. Hodges Author-X-Name-First: James S. Author-X-Name-Last: Hodges Author-Name: Liying Luo Author-X-Name-First: Liying Author-X-Name-Last: Luo Title: Letter To the Editor Journal: Journal of the American Statistical Association Pages: 457-457 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2015.1008100 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1008100 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:457a-457a Template-Type: ReDIF-Article 1.0 Author-Name: Y. Claire Yang Author-X-Name-First: Y. Claire Author-X-Name-Last: Yang Author-Name: Kenneth C. Land Author-X-Name-First: Kenneth C. Author-X-Name-Last: Land Title: Reply Journal: Journal of the American Statistical Association Pages: 457-457 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2015.1008843 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1008843 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:457b-457b Template-Type: ReDIF-Article 1.0 Author-Name: Wenjiang J. Fu Author-X-Name-First: Wenjiang J. Author-X-Name-Last: Fu Title: Reply Journal: Journal of the American Statistical Association Pages: 458-458 Issue: 509 Volume: 110 Year: 2015 Month: 3 X-DOI: 10.1080/01621459.2015.1008849 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1008849 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:458-458 Template-Type: ReDIF-Article 1.0 Author-Name: James G. Scott Author-X-Name-First: James G. Author-X-Name-Last: Scott Author-Name: Ryan C. Kelly Author-X-Name-First: Ryan C. Author-X-Name-Last: Kelly Author-Name: Matthew A. Smith Author-X-Name-First: Matthew A. Author-X-Name-Last: Smith Author-Name: Pengcheng Zhou Author-X-Name-First: Pengcheng Author-X-Name-Last: Zhou Author-Name: Robert E. Kass Author-X-Name-First: Robert E. Author-X-Name-Last: Kass Title: False Discovery Rate Regression: An Application to Neural Synchrony Detection in Primary Visual Cortex Abstract: This article introduces false discovery rate regression, a method for incorporating covariate information into large-scale multiple-testing problems. FDR regression estimates a relationship between test-level covariates and the prior probability that a given observation is a signal. It then uses this estimated relationship to inform the outcome of each test in a way that controls the overall false discovery rate at a prespecified level. This poses many subtle issues at the interface between inference and computation, and we investigate several variations of the overall approach. Simulation evidence suggests that: (1) when covariate effects are present, FDR regression improves power for a fixed false-discovery rate; and (2) when covariate effects are absent, the method is robust, in the sense that it does not lead to inflated error rates. We apply the method to neural recordings from primary visual cortex. The goal is to detect pairs of neurons that exhibit fine-time-scale interactions, in the sense that they fire together more often than expected due to chance. Our method detects roughly 50% more synchronous pairs versus a standard FDR-controlling analysis. The companion R package FDRreg implements all methods described in the article. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 459-471 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.990973 File-URL: http://hdl.handle.net/10.1080/01621459.2014.990973 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:459-471 Template-Type: ReDIF-Article 1.0 Author-Name: Xiangrong Kong Author-X-Name-First: Xiangrong Author-X-Name-Last: Kong Author-Name: Mei-Cheng Wang Author-X-Name-First: Mei-Cheng Author-X-Name-Last: Wang Author-Name: Ronald Gray Author-X-Name-First: Ronald Author-X-Name-Last: Gray Title: Analysis of Longitudinal Multivariate Outcome Data From Couples Cohort Studies: Application to HPV Transmission Dynamics Abstract: We consider a specific situation of correlated data where multiple outcomes are repeatedly measured on each member of a couple. Such multivariate longitudinal data from couples may exhibit multi-faceted correlations that can be further complicated if there are polygamous partnerships. An example is data from cohort studies on human papillomavirus (HPV) transmission dynamics in heterosexual couples. HPV is a common sexually transmitted disease with 14 known oncogenic types causing anogenital cancers. The binary outcomes on the multiple types measured in couples over time may introduce inter-type, intra-couple, and temporal correlations. Simple analysis using generalized estimating equations or random effects models lacks interpretability and cannot fully use the available information. We developed a hybrid modeling strategy using Markov transition models together with pairwise composite likelihood for analyzing such data. The method can be used to identify risk factors associated with HPV transmission and persistence, estimate difference in risks between male-to-female and female-to-male HPV transmission, compare type-specific transmission risks within couples, and characterize the inter-type and intra-couple associations. Applying the method to HPV couple data collected in a Ugandan male circumcision (MC) trial, we assessed the effect of MC and the role of gender on risks of HPV transmission and persistence. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 472-485 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.991394 File-URL: http://hdl.handle.net/10.1080/01621459.2014.991394 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:472-485 Template-Type: ReDIF-Article 1.0 Author-Name: Jie Li Author-X-Name-First: Jie Author-X-Name-Last: Li Author-Name: Yili Hong Author-X-Name-First: Yili Author-X-Name-Last: Hong Author-Name: Ram Thapa Author-X-Name-First: Ram Author-X-Name-Last: Thapa Author-Name: Harold E. Burkhart Author-X-Name-First: Harold E. Author-X-Name-Last: Burkhart Title: Survival Analysis of Loblolly Pine Trees With Spatially Correlated Random Effects Abstract: Loblolly pine, a native pine species of the southeastern United States, is the most-planted species for commercial timber. Predicting survival of loblolly pine following planting is of great interest to researchers in forestry science as it is closely related to the yield of timber. Data were collected from a region-wide thinning study, where permanent plots, located at 182 sites ranging from central Texas east to Florida and north to Delaware, were established in 1980-1981. One of the main objectives of this study was to investigate the relationship between the survival of loblolly pine trees and several important covariates such as age, thinning types, and physiographic regions, while adjusting for spatial correlation among different sites. We use a semiparametric proportional hazards model to describe the effects of covariates on the survival time, and incorporate the spatial random effects in the model to describe the spatial correlation among different sites. We apply the expectation-maximization (EM) algorithm to estimate the parameters in the model and conduct simulations to validate the estimation procedure. We also compare the proposed method with existing methods through simulations and discussions. Then we apply the developed method to the large-scale loblolly pine tree survival data and interpret the results. We conclude this article with discussions on the advantages of the proposed method, major findings of data analysis, and directions for future research. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 486-502 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.995793 File-URL: http://hdl.handle.net/10.1080/01621459.2014.995793 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:486-502 Template-Type: ReDIF-Article 1.0 Author-Name: Yanxun Xu Author-X-Name-First: Yanxun Author-X-Name-Last: Xu Author-Name: Peter Müller Author-X-Name-First: Peter Author-X-Name-Last: Müller Author-Name: Yuan Yuan Author-X-Name-First: Yuan Author-X-Name-Last: Yuan Author-Name: Kamalakar Gulukota Author-X-Name-First: Kamalakar Author-X-Name-Last: Gulukota Author-Name: Yuan Ji Author-X-Name-First: Yuan Author-X-Name-Last: Ji Title: MAD Bayes for Tumor Heterogeneity--Feature Allocation With Exponential Family Sampling Abstract: We propose small-variance asymptotic approximations for inference on tumor heterogeneity (TH) using next-generation sequencing data. Understanding TH is an important and open research problem in biology. The lack of appropriate statistical inference is a critical gap in existing methods that the proposed approach aims to fill. We build on a hierarchical model with an exponential family likelihood and a feature allocation prior. The proposed implementation of posterior inference generalizes similar small-variance approximations proposed by Kulis and Jordan and Broderick, Kulis, and Jordan for inference with Dirichlet process mixture and Indian buffet process prior models under normal sampling. We show that the new algorithm can successfully recover latent structures of different haplotypes and subclones and is magnitudes faster than available Markov chain Monte Carlo samplers. The latter are practically infeasible for high-dimensional genomics data. The proposed approach is scalable, easy to implement, and benefits from the flexibility of Bayesian nonparametric models. More importantly, it provides a useful tool for applied scientists to estimate cell subtypes in tumor samples. R code is available on http://www.ma.utexas.edu/users/yxu/. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 503-514 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.995794 File-URL: http://hdl.handle.net/10.1080/01621459.2014.995794 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:503-514 Template-Type: ReDIF-Article 1.0 Author-Name: Samuel D. Pimentel Author-X-Name-First: Samuel D. Author-X-Name-Last: Pimentel Author-Name: Rachel R. Kelz Author-X-Name-First: Rachel R. Author-X-Name-Last: Kelz Author-Name: Jeffrey H. Silber Author-X-Name-First: Jeffrey H. Author-X-Name-Last: Silber Author-Name: Paul R. Rosenbaum Author-X-Name-First: Paul R. Author-X-Name-Last: Rosenbaum Title: Large, Sparse Optimal Matching With Refined Covariate Balance in an Observational Study of the Health Outcomes Produced by New Surgeons Abstract: Every newly trained surgeon performs her first unsupervised operation. How do the health outcomes of her patients compare with the patients of experienced surgeons? Using data from 498 hospitals, we compare 1252 pairs comprised of a new surgeon and an experienced surgeon working at the same hospital. We introduce a new form of matching that matches patients of each new surgeon to patients of an otherwise similar experienced surgeon at the same hospital, perfectly balancing 176 surgical procedures and closely balancing a total of 2.9 million categories of patients; additionally, the individual patient pairs are as close as possible. A new goal for matching is introduced, called "refined covariate balance," in which a sequence of nested, ever more refined, nominal covariates is balanced as closely as possible, emphasizing the first or coarsest covariate in that sequence. A new algorithm for matching is proposed and the main new results prove that the algorithm finds the closest match in terms of the total within-pair covariate distances among all matches that achieve refined covariate balance. Unlike previous approaches to forcing balance on covariates, the new algorithm creates multiple paths to a match in a network, where paths that introduce imbalances are penalized and hence avoided to the extent possible. The algorithm exploits a sparse network to quickly optimize a match that is about two orders of magnitude larger than is typical in statistical matching problems, thereby permitting much more extensive use of fine and near-fine balance constraints. The match was constructed in a few minutes using a network optimization algorithm implemented in R. An R package called rcbalance implementing the method is available from CRAN. Journal: Journal of the American Statistical Association Pages: 515-527 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.997879 File-URL: http://hdl.handle.net/10.1080/01621459.2014.997879 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:515-527 Template-Type: ReDIF-Article 1.0 Author-Name: Hui Yao Author-X-Name-First: Hui Author-X-Name-Last: Yao Author-Name: Sungduk Kim Author-X-Name-First: Sungduk Author-X-Name-Last: Kim Author-Name: Ming-Hui Chen Author-X-Name-First: Ming-Hui Author-X-Name-Last: Chen Author-Name: Joseph G. Ibrahim Author-X-Name-First: Joseph G. Author-X-Name-Last: Ibrahim Author-Name: Arvind K. Shah Author-X-Name-First: Arvind K. Author-X-Name-Last: Shah Author-Name: Jianxin Lin Author-X-Name-First: Jianxin Author-X-Name-Last: Lin Title: Bayesian Inference for Multivariate Meta-Regression With a Partially Observed Within-Study Sample Covariance Matrix Abstract: Multivariate meta-regression models are commonly used in settings where the response variable is naturally multidimensional. Such settings are common in cardiovascular and diabetes studies where the goal is to study cholesterol levels once a certain medication is given. In this setting, the natural multivariate endpoint is low density lipoprotein cholesterol (LDL-C), high density lipoprotein cholesterol (HDL-C), and triglycerides (TG) (LDL-C, HDL-C, TG). In this article, we examine study level (aggregate) multivariate meta-data from 26 Merck sponsored double-blind, randomized, active, or placebo-controlled clinical trials on adult patients with primary hypercholesterolemia. Our goal is to develop a methodology for carrying out Bayesian inference for multivariate meta-regression models with study level data when the within-study sample covariance matrix S for the multivariate response data is partially observed. Specifically, the proposed methodology is based on postulating a multivariate random effects regression model with an unknown within-study covariance matrix Σ in which we treat the within-study sample correlations as missing data, the standard deviations of the within-study sample covariance matrix S are assumed observed, and given Σ, S follows a Wishart distribution. Thus, we treat the off-diagonal elements of S as missing data, and these missing elements are sampled from the appropriate full conditional distribution in a Markov chain Monte Carlo (MCMC) sampling scheme via a novel transformation based on partial correlations. We further propose several structures (models) for Σ, which allow for borrowing strength across different treatment arms and trials. The proposed methodology is assessed using simulated as well as real data, and the results are shown to be quite promising. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 528-544 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2015.1006065 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1006065 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:528-544 Template-Type: ReDIF-Article 1.0 Author-Name: P. Z. Hadjipantelis Author-X-Name-First: P. Z. Author-X-Name-Last: Hadjipantelis Author-Name: J. A. D. Aston Author-X-Name-First: J. A. D. Author-X-Name-Last: Aston Author-Name: H. G. Müller Author-X-Name-First: H. G. Author-X-Name-Last: Müller Author-Name: J. P. Evans Author-X-Name-First: J. P. Author-X-Name-Last: Evans Title: Unifying Amplitude and Phase Analysis: A Compositional Data Approach to Functional Multivariate Mixed-Effects Modeling of Mandarin Chinese Abstract: Mandarin Chinese is characterized by being a tonal language; the pitch (or F0) of its utterances carries considerable linguistic information. However, speech samples from different individuals are subject to changes in amplitude and phase, which must be accounted for in any analysis that attempts to provide a linguistically meaningful description of the language. A joint model for amplitude, phase, and duration is presented, which combines elements from functional data analysis, compositional data analysis, and linear mixed effects models. By decomposing functions via a functional principal component analysis, and connecting registration functions to compositional data analysis, a joint multivariate mixed effect model can be formulated, which gives insights into the relationship between the different modes of variation as well as their dependence on linguistic and nonlinguistic covariates. The model is applied to the COSPRO-1 dataset, a comprehensive database of spoken Taiwanese Mandarin, containing approximately 50,000 phonetically diverse sample F0 contours (syllables), and reveals that phonetic information is jointly carried by both amplitude and phase variation. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 545-559 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2015.1006729 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1006729 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:545-559 Template-Type: ReDIF-Article 1.0 Author-Name: Ran Tao Author-X-Name-First: Ran Author-X-Name-Last: Tao Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Nora Franceschini Author-X-Name-First: Nora Author-X-Name-Last: Franceschini Author-Name: Kari E. North Author-X-Name-First: Kari E. Author-X-Name-Last: North Author-Name: Eric Boerwinkle Author-X-Name-First: Eric Author-X-Name-Last: Boerwinkle Author-Name: Dan-Yu Lin Author-X-Name-First: Dan-Yu Author-X-Name-Last: Lin Title: Analysis of Sequence Data Under Multivariate Trait-Dependent Sampling Abstract: High-throughput DNA sequencing allows for the genotyping of common and rare variants for genetic association studies. At the present time and for the foreseeable future, it is not economically feasible to sequence all individuals in a large cohort. A cost-effective strategy is to sequence those individuals with extreme values of a quantitative trait. We consider the design under which the sampling depends on multiple quantitative traits. Under such trait-dependent sampling, standard linear regression analysis can result in bias of parameter estimation, inflation of Type I error, and loss of power. We construct a likelihood function that properly reflects the sampling mechanism and uses all available data. We implement a computationally efficient EM algorithm and establish the theoretical properties of the resulting maximum likelihood estimators. Our methods can be used to perform separate inference on each trait or simultaneous inference on multiple traits. We pay special attention to gene-level association tests for rare variants. We demonstrate the superiority of the proposed methods over standard linear regression through extensive simulation studies. We provide applications to the Cohorts for Heart and Aging Research in Genomic Epidemiology Targeted Sequencing Study and the National Heart, Lung, and Blood Institute Exome Sequencing Project. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 560-572 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2015.1008099 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1008099 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:560-572 Template-Type: ReDIF-Article 1.0 Author-Name: Bradley J. Barney Author-X-Name-First: Bradley J. Author-X-Name-Last: Barney Author-Name: Federica Amici Author-X-Name-First: Federica Author-X-Name-Last: Amici Author-Name: Filippo Aureli Author-X-Name-First: Filippo Author-X-Name-Last: Aureli Author-Name: Josep Call Author-X-Name-First: Josep Author-X-Name-Last: Call Author-Name: Valen E. Johnson Author-X-Name-First: Valen E. Author-X-Name-Last: Johnson Title: Joint Bayesian Modeling of Binomial and Rank Data for Primate Cognition Abstract: In recent years, substantial effort has been devoted to methods for analyzing data containing mixed response types, but such techniques typically do not include rank data among the response types. Some unique challenges exist in analyzing rank data, particularly when ties are prevalent. We present techniques for jointly modeling binomial and rank data using Bayesian latent variable models. We apply these techniques to compare the cognitive abilities of nonhuman primates based on their performance on 17 cognitive tasks scored on either a rank or binomial scale. To jointly model the rank and binomial responses, we assume that responses are implicitly determined by latent cognitive abilities. We then model the latent variables using random effects models, with identifying restrictions chosen to promote parsimonious prior specification and model inferences. Results from the primate cognitive data are presented to illustrate the methodology. Our results suggest that the ordering of the cognitive abilities of species varies significantly across tasks, suggesting a partially independent evolution of cognitive abilities in primates. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 573-582 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2015.1016223 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1016223 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:573-582 Template-Type: ReDIF-Article 1.0 Author-Name: Ying-Qi Zhao Author-X-Name-First: Ying-Qi Author-X-Name-Last: Zhao Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Eric B. Laber Author-X-Name-First: Eric B. Author-X-Name-Last: Laber Author-Name: Michael R. Kosorok Author-X-Name-First: Michael R. Author-X-Name-Last: Kosorok Title: New Statistical Learning Methods for Estimating Optimal Dynamic Treatment Regimes Abstract: Dynamic treatment regimes (DTRs) are sequential decision rules for individual patients that can adapt over time to an evolving illness. The goal is to accommodate heterogeneity among patients and find the DTR which will produce the best long-term outcome if implemented. We introduce two new statistical learning methods for estimating the optimal DTR, termed backward outcome weighted learning (BOWL), and simultaneous outcome weighted learning (SOWL). These approaches convert individualized treatment selection into an either sequential or simultaneous classification problem, and can thus be applied by modifying existing machine learning techniques. The proposed methods are based on directly maximizing over all DTRs a nonparametric estimator of the expected long-term outcome; this is fundamentally different than regression-based methods, for example, Q-learning, which indirectly attempt such maximization and rely heavily on the correctness of postulated regression models. We prove that the resulting rules are consistent, and provide finite sample bounds for the errors using the estimated rules. Simulation results suggest the proposed methods produce superior DTRs compared with Q-learning especially in small samples. We illustrate the methods using data from a clinical trial for smoking cessation. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 583-598 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.937488 File-URL: http://hdl.handle.net/10.1080/01621459.2014.937488 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:583-598 Template-Type: ReDIF-Article 1.0 Author-Name: R. Dennis Cook Author-X-Name-First: R. Dennis Author-X-Name-Last: Cook Author-Name: Xin Zhang Author-X-Name-First: Xin Author-X-Name-Last: Zhang Title: Foundations for Envelope Models and Methods Abstract: Envelopes were recently proposed by Cook, Li and Chiaromonte as a method for reducing estimative and predictive variations in multivariate linear regression. We extend their formulation, proposing a general definition of an envelope and a general framework for adapting envelope methods to any estimation procedure. We apply the new envelope methods to weighted least squares, generalized linear models and Cox regression. Simulations and illustrative data analysis show the potential for envelope methods to significantly improve standard methods in linear discriminant analysis, logistic regression and Poisson regression. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 599-611 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.983235 File-URL: http://hdl.handle.net/10.1080/01621459.2014.983235 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:599-611 Template-Type: ReDIF-Article 1.0 Author-Name: C. F. Jeff Wu Author-X-Name-First: C. F. Jeff Author-X-Name-Last: Wu Title: Post-Fisherian Experimentation: From Physical to Virtual Abstract: Fisher's pioneering work in design of experiments has inspired further work with broader applications, especially in industrial experimentation. This article discusses three topics in physical experiments: principles of effect hierarchy, sparsity, and heredity for factorial designs, a new method called conditional main effect (CME) for de-aliasing aliased effects, and robust parameter design. I also review the recent emergence of virtual experiments on a computer. Some major challenges in computer experiments, which must go beyond Fisherian principles, are outlined. Journal: Journal of the American Statistical Association Pages: 612-620 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.914441 File-URL: http://hdl.handle.net/10.1080/01621459.2014.914441 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:612-620 Template-Type: ReDIF-Article 1.0 Author-Name: Sy Han Chiou Author-X-Name-First: Sy Han Author-X-Name-Last: Chiou Author-Name: Sangwook Kang Author-X-Name-First: Sangwook Author-X-Name-Last: Kang Author-Name: Jun Yan Author-X-Name-First: Jun Author-X-Name-Last: Yan Title: Semiparametric Accelerated Failure Time Modeling for Clustered Failure Times From Stratified Sampling Abstract: Clustered failure times often arise from studies with stratified sampling designs where it is desired to reduce both cost and sampling error. Semiparametric accelerated failure time (AFT) models have not been used as frequently as Cox relative risk models in such settings due to lack of efficient and reliable computing routines for inferences. The challenge roots in the nonsmoothness of the rank-based estimating functions, and for clustered data, the asymptotic properties of the estimator from the weighted version have not been available. The recently proposed induced smoothing approach, which provides fast and accurate rank-based inferences for AFT models, is generalized to incorporate weights to accommodate stratified sampling designs. The estimator from the induced smoothing weighted estimating equations are shown to be consistent and have the same asymptotic distribution as that from the nonsmooth version, which has not been developed before. The variance of the estimator is estimated by computationally efficient sandwich estimators aided by a multiplier bootstrap. The proposed method is assessed in extensive simulation studies where the estimators appear to provide valid and efficient inferences. A stratified case-cohort design with clustered times to tooth extraction in a dental study illustrates the usefulness of the method. Journal: Journal of the American Statistical Association Pages: 621-629 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.917978 File-URL: http://hdl.handle.net/10.1080/01621459.2014.917978 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:621-629 Template-Type: ReDIF-Article 1.0 Author-Name: Hengjian Cui Author-X-Name-First: Hengjian Author-X-Name-Last: Cui Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Author-Name: Wei Zhong Author-X-Name-First: Wei Author-X-Name-Last: Zhong Title: Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis Abstract: This work is concerned with marginal sure independence feature screening for ultrahigh dimensional discriminant analysis. The response variable is categorical in discriminant analysis. This enables us to use the conditional distribution function to construct a new index for feature screening. In this article, we propose a marginal feature screening procedure based on empirical conditional distribution function. We establish the sure screening and ranking consistency properties for the proposed procedure without assuming any moment condition on the predictors. The proposed procedure enjoys several appealing merits. First, it is model-free in that its implementation does not require specification of a regression model. Second, it is robust to heavy-tailed distributions of predictors and the presence of potential outliers. Third, it allows the categorical response having a diverging number of classes in the order of O(n-super-κ) with some κ ⩾ 0. We assess the finite sample property of the proposed procedure by Monte Carlo simulation studies and numerical comparison. We further illustrate the proposed methodology by empirical analyses of two real-life datasets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 630-641 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.920256 File-URL: http://hdl.handle.net/10.1080/01621459.2014.920256 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:630-641 Template-Type: ReDIF-Article 1.0 Author-Name: Bo Jiang Author-X-Name-First: Bo Author-X-Name-Last: Jiang Author-Name: Chao Ye Author-X-Name-First: Chao Author-X-Name-Last: Ye Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Title: Nonparametric K-Sample Tests via Dynamic Slicing Abstract: K-sample testing problems arise in many scientific applications and have attracted statisticians' attention for many years. We propose an omnibus nonparametric method based on an optimal discretization (aka "slicing") of continuous random variables in the test. The novelty of our approach lies in the inclusion of a term penalizing the number of slices (i.e., the resolution of the discretization) so as to regularize the corresponding likelihood-ratio test statistic. An efficient dynamic programming algorithm is developed to determine the optimal slicing scheme. Asymptotic and finite-sample properties such as power and null distribution of the resulting test statistic are studied. We compare the proposed testing method with some existing well-known methods and demonstrate its statistical power through extensive simulation studies as well as a real data example. A dynamic slicing method for the one-sample testing problem is further developed and studied under the same framework. Supplementary materials including technical derivations and proofs are available online. Journal: Journal of the American Statistical Association Pages: 642-653 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.920257 File-URL: http://hdl.handle.net/10.1080/01621459.2014.920257 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:642-653 Template-Type: ReDIF-Article 1.0 Author-Name: Philip Preuss Author-X-Name-First: Philip Author-X-Name-Last: Preuss Author-Name: Ruprecht Puchstein Author-X-Name-First: Ruprecht Author-X-Name-Last: Puchstein Author-Name: Holger Dette Author-X-Name-First: Holger Author-X-Name-Last: Dette Title: Detection of Multiple Structural Breaks in Multivariate Time Series Abstract: We propose a new nonparametric procedure (referred to as MuBreD) for the detection and estimation of multiple structural breaks in the autocovariance function of a multivariate (second-order) piecewise stationary process, which also identifies the components of the series where the breaks occur. MuBreD is based on a comparison of the estimated spectral distribution on different segments of the observed time series and consists of three steps: it starts with a consistent test, which allows us to prove the existence of structural breaks at a controlled Type I error. Second, it estimates sets containing possible break points and finally these sets are reduced to identify the relevant structural breaks and corresponding components which are responsible for the changes in the autocovariance structure. In contrast to all other methods proposed in the literature, our approach does not make any parametric assumptions, is not especially designed for detecting one single change point, and addresses the problem of multiple structural breaks in the autocovariance function directly with no use of the binary segmentation algorithm. We prove that the new procedure detects all components and the corresponding locations where structural breaks occur with probability converging to one as the sample size increases and provide data-driven rules for the selection of all regularization parameters. The results are illustrated by analyzing financial asset returns, and in a simulation study it is demonstrated that MuBreD outperforms the currently available nonparametric methods for detecting breaks in the dependency structure of multivariate time series. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 654-668 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.920613 File-URL: http://hdl.handle.net/10.1080/01621459.2014.920613 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:654-668 Template-Type: ReDIF-Article 1.0 Author-Name: Wei Ma Author-X-Name-First: Wei Author-X-Name-Last: Ma Author-Name: Feifang Hu Author-X-Name-First: Feifang Author-X-Name-Last: Hu Author-Name: Lixin Zhang Author-X-Name-First: Lixin Author-X-Name-Last: Zhang Title: Testing Hypotheses of Covariate-Adaptive Randomized Clinical Trials Abstract: Covariate-adaptive designs are often implemented to balance important covariates in clinical trials. However, the theoretical properties of conventional testing hypotheses are usually unknown under covariate-adaptive randomized clinical trials. In the literature, most studies are based on simulations. In this article, we provide theoretical foundation of hypothesis testing under covariate-adaptive designs based on linear models. We derive the asymptotic distributions of the test statistics of testing both treatment effects and the significance of covariates under null and alternative hypotheses. Under a large class of covariate-adaptive designs, (i) the hypothesis testing to compare treatment effects is usually conservative in terms of small Type I error; (ii) the hypothesis testing to compare treatment effects is usually more powerful than complete randomization; and (iii) the hypothesis testing for significance of covariates is still valid. The class includes most of the covariate-adaptive designs in the literature; for example, Pocock and Simon's marginal procedure, stratified permuted block design, etc. Numerical studies are also performed to assess their corresponding finite sample properties. Supplementary material for this article is available online. Journal: Journal of the American Statistical Association Pages: 669-680 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.922469 File-URL: http://hdl.handle.net/10.1080/01621459.2014.922469 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:669-680 Template-Type: ReDIF-Article 1.0 Author-Name: Grace Y. Yi Author-X-Name-First: Grace Y. Author-X-Name-Last: Yi Author-Name: Yanyuan Ma Author-X-Name-First: Yanyuan Author-X-Name-Last: Ma Author-Name: Donna Spiegelman Author-X-Name-First: Donna Author-X-Name-Last: Spiegelman Author-Name: Raymond J. Carroll Author-X-Name-First: Raymond J. Author-X-Name-Last: Carroll Title: Functional and Structural Methods With Mixed Measurement Error and Misclassification in Covariates Abstract: Covariate measurement imprecision or errors arise frequently in many areas. It is well known that ignoring such errors can substantially degrade the quality of inference or even yield erroneous results. Although in practice both covariates subject to measurement error and covariates subject to misclassification can occur, research attention in the literature has mainly focused on addressing either one of these problems separately. To fill this gap, we develop estimation and inference methods that accommodate both characteristics simultaneously. Specifically, we consider measurement error and misclassification in generalized linear models under the scenario that an external validation study is available, and systematically develop a number of effective functional and structural methods. Our methods can be applied to different situations to meet various objectives. Journal: Journal of the American Statistical Association Pages: 681-696 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.922777 File-URL: http://hdl.handle.net/10.1080/01621459.2014.922777 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:681-696 Template-Type: ReDIF-Article 1.0 Author-Name: Catalina A. Vallejos Author-X-Name-First: Catalina A. Author-X-Name-Last: Vallejos Author-Name: Mark F. J. Steel Author-X-Name-First: Mark F. J. Author-X-Name-Last: Steel Title: Objective Bayesian Survival Analysis Using Shape Mixtures of Log-Normal Distributions Abstract: Survival models such as the Weibull or log-normal lead to inference that is not robust to the presence of outliers. They also assume that all heterogeneity between individuals can be modeled through covariates. This article considers the use of infinite mixtures of lifetime distributions as a solution for these two issues. This can be interpreted as the introduction of a random effect in the survival distribution. We introduce the family of shape mixtures of log-normal distributions, which covers a wide range of density and hazard functions. Bayesian inference under nonsubjective priors based on the Jeffreys' rule is examined and conditions for posterior propriety are established. The existence of the posterior distribution on the basis of a sample of point observations is not always guaranteed and a solution through set observations is implemented. In addition, we propose a method for outlier detection based on the mixture structure. A simulation study illustrates the performance of our methods under different scenarios and an application to a real dataset is provided. Supplementary materials for the article, which include R code, are available online. Journal: Journal of the American Statistical Association Pages: 697-710 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.923316 File-URL: http://hdl.handle.net/10.1080/01621459.2014.923316 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:697-710 Template-Type: ReDIF-Article 1.0 Author-Name: Juhee Lee Author-X-Name-First: Juhee Author-X-Name-Last: Lee Author-Name: Peter F. Thall Author-X-Name-First: Peter F. Author-X-Name-Last: Thall Author-Name: Yuan Ji Author-X-Name-First: Yuan Author-X-Name-Last: Ji Author-Name: Peter Müller Author-X-Name-First: Peter Author-X-Name-Last: Müller Title: Bayesian Dose-Finding in Two Treatment Cycles Based on the Joint Utility of Efficacy and Toxicity Abstract: This article proposes a phase I/II clinical trial design for adaptively and dynamically optimizing each patient's dose in each of two cycles of therapy based on the joint binary efficacy and toxicity outcomes in each cycle. A dose-outcome model is assumed that includes a Bayesian hierarchical latent variable structure to induce association among the outcomes and also facilitate posterior computation. Doses are chosen in each cycle based on posteriors of a model-based objective function, similar to a reinforcement learning or Q-learning function, defined in terms of numerical utilities of the joint outcomes in each cycle. For each patient, the procedure outputs a sequence of two actions, one for each cycle, with each action being the decision to either treat the patient at a chosen dose or not to treat. The cycle 2 action depends on the individual patient's cycle 1 dose and outcomes. In addition, decisions are based on posterior inference using other patients' data, and therefore, the proposed method is adaptive both within and between patients. A simulation study of the method is presented, including comparison to two-cycle extensions of the conventional 3 + 3 algorithm, continual reassessment method, and a Bayesian model-based design, and evaluation of robustness. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 711-722 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.926815 File-URL: http://hdl.handle.net/10.1080/01621459.2014.926815 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:711-722 Template-Type: ReDIF-Article 1.0 Author-Name: Xuerong Chen Author-X-Name-First: Xuerong Author-X-Name-Last: Chen Author-Name: Alan T. K. Wan Author-X-Name-First: Alan T. K. Author-X-Name-Last: Wan Author-Name: Yong Zhou Author-X-Name-First: Yong Author-X-Name-Last: Zhou Title: Efficient Quantile Regression Analysis With Missing Observations Abstract: This article examines the problem of estimation in a quantile regression model when observations are missing at random under independent and nonidentically distributed errors. We consider three approaches of handling this problem based on nonparametric inverse probability weighting, estimating equations projection, and a combination of both. An important distinguishing feature of our methods is their ability to handle missing response and/or partially missing covariates, whereas existing techniques can handle only one or the other, but not both. We prove that our methods yield asymptotically equivalent estimators that achieve the desirable asymptotic properties of unbiasedness, normality, and -consistency. Because we do not assume that the errors are identically distributed, our theoretical results are valid under heteroscedasticity, a particularly strong feature of our methods. Under the special case of identical error distributions, all of our proposed estimators achieve the semiparametric efficiency bound. To facilitate the practical implementation of these methods, we develop an iterative method based on the majorize/minimize algorithm for computing the quantile regression estimates, and a bootstrap method for computing their variances. Our simulation findings suggest that all three methods have good finite sample properties. We further illustrate these methods by a real data example. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 723-741 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.928219 File-URL: http://hdl.handle.net/10.1080/01621459.2014.928219 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:723-741 Template-Type: ReDIF-Article 1.0 Author-Name: Nikolaos Sgouropoulos Author-X-Name-First: Nikolaos Author-X-Name-Last: Sgouropoulos Author-Name: Qiwei Yao Author-X-Name-First: Qiwei Author-X-Name-Last: Yao Author-Name: Claudia Yastremiz Author-X-Name-First: Claudia Author-X-Name-Last: Yastremiz Title: Matching a Distribution by Matching Quantiles Estimation Abstract: Motivated by the problem of selecting representative portfolios for backtesting counterparty credit risks, we propose a matching quantiles estimation (MQE) method for matching a target distribution by that of a linear combination of a set of random variables. An iterative procedure based on the ordinary least-squares estimation (OLS) is proposed to compute MQE. MQE can be easily modified by adding a LASSO penalty term if a sparse representation is desired, or by restricting the matching within certain range of quantiles to match a part of the target distribution. The convergence of the algorithm and the asymptotic properties of the estimation, both with or without LASSO, are established. A measure and an associated statistical test are proposed to assess the goodness-of-match. The finite sample properties are illustrated by simulation. An application in selecting a counterparty representative portfolio with a real dataset is reported. The proposed MQE also finds applications in portfolio tracking, which demonstrates the usefulness of combining MQE with LASSO. Journal: Journal of the American Statistical Association Pages: 742-759 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.929522 File-URL: http://hdl.handle.net/10.1080/01621459.2014.929522 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:742-759 Template-Type: ReDIF-Article 1.0 Author-Name: Randy C. S. Lai Author-X-Name-First: Randy C. S. Author-X-Name-Last: Lai Author-Name: Jan Hannig Author-X-Name-First: Jan Author-X-Name-Last: Hannig Author-Name: Thomas C. M. Lee Author-X-Name-First: Thomas C. M. Author-X-Name-Last: Lee Title: Generalized Fiducial Inference for Ultrahigh-Dimensional Regression Abstract: In recent years, the ultrahigh-dimensional linear regression problem has attracted enormous attention from the research community. Under the sparsity assumption, most of the published work is devoted to the selection and estimation of the predictor variables with nonzero coefficients. This article studies a different but fundamentally important aspect of this problem: uncertainty quantification for parameter estimates and model choices. To be more specific, this article proposes methods for deriving a probability density function on the set of all possible models, and also for constructing confidence intervals for the corresponding parameters. These proposed methods are developed using the generalized fiducial methodology, which is a variant of Fisher's controversial fiducial idea. Theoretical properties of the proposed methods are studied, and in particular it is shown that statistical inference based on the proposed methods will have correct asymptotic frequentist property. In terms of empirical performance, the proposed methods are tested by simulation experiments and an application to a real dataset. Finally, this work can also be seen as an interesting and successful application of Fisher's fiducial idea to an important and contemporary problem. To the best of the authors' knowledge, this is the first time that the fiducial idea is being applied to a so-called "large p small n" problem. A connection to objective Bayesian model selection is also discussed. Journal: Journal of the American Statistical Association Pages: 760-772 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.931237 File-URL: http://hdl.handle.net/10.1080/01621459.2014.931237 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:760-772 Template-Type: ReDIF-Article 1.0 Author-Name: Kuangyu Wen Author-X-Name-First: Kuangyu Author-X-Name-Last: Wen Author-Name: Ximing Wu Author-X-Name-First: Ximing Author-X-Name-Last: Wu Title: An Improved Transformation-Based Kernel Estimator of Densities on the Unit Interval Abstract: The kernel density estimator (KDE) suffers boundary biases when applied to densities on bounded supports, which are assumed to be the unit interval. Transformations mapping the unit interval to the real line can be used to remove boundary biases. However, this approach may induce erratic tail behaviors when the estimated density of transformed data is transformed back to its original scale. We propose a modified, transformation-based KDE that employs a tapered and tilted back-transformation. We derive the theoretical properties of the new estimator and show that it asymptotically dominates the naive transformation based estimator while maintains its simplicity. We then propose three automatic methods of smoothing parameter selection. Our Monte Carlo simulations demonstrate the good finite sample performance of the proposed estimator, especially for densities with poles near the boundaries. An example with real data is provided. Journal: Journal of the American Statistical Association Pages: 773-783 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.969426 File-URL: http://hdl.handle.net/10.1080/01621459.2014.969426 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:773-783 Template-Type: ReDIF-Article 1.0 Author-Name: Ke Zhu Author-X-Name-First: Ke Author-X-Name-Last: Zhu Author-Name: Shiqing Ling Author-X-Name-First: Shiqing Author-X-Name-Last: Ling Title: LADE-Based Inference for ARMA Models With Unspecified and Heavy-Tailed Heteroscedastic Noises Abstract: This article develops a systematic procedure of statistical inference for the auto-regressive moving average (ARMA) model with unspecified and heavy-tailed heteroscedastic noises. We first investigate the least absolute deviation estimator (LADE) and the self-weighted LADE for the model. Both estimators are shown to be strongly consistent and asymptotically normal when the noise has a finite variance and infinite variance, respectively. The rates of convergence of the LADE and the self-weighted LADE are n-super- - 1/2, which is faster than those of least-square estimator (LSE) for the ARMA model when the tail index of generalized auto-regressive conditional heteroskedasticity (GARCH) noises is in (0, 4], and thus they are more efficient in this case. Since their asymptotic covariance matrices cannot be estimated directly from the sample, we develop the random weighting approach for statistical inference under this nonstandard case. We further propose a novel sign-based portmanteau test for model adequacy. Simulation study is carried out to assess the performance of our procedure and one real illustrating example is given. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 784-794 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.977386 File-URL: http://hdl.handle.net/10.1080/01621459.2014.977386 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:784-794 Template-Type: ReDIF-Article 1.0 Author-Name: Li Ma Author-X-Name-First: Li Author-X-Name-Last: Ma Title: Scalable Bayesian Model Averaging Through Local Information Propagation Abstract: This article shows that a probabilistic version of the classical forward-stepwise variable inclusion procedure can serve as a general data-augmentation scheme for model space distributions in (generalized) linear models. This latent variable representation takes the form of a Markov process, thereby allowing information propagation algorithms to be applied for sampling from model space posteriors. In particular, We propose a sequential Monte Carlo method for achieving effective unbiased Bayesian model averaging in high-dimensional problems, using proposal distributions constructed using local information propagation. The method--called LIPS for local information propagation based sampling--is illustrated using real and simulated examples with dimensionality ranging from 15 to 1000, and its performance in estimating posterior inclusion probabilities and in out-of-sample prediction is compared to those of several other methods--namely, MCMC, BAS, iBMA, and LASSO. In addition, it is shown that the latent variable representation can also serve as a modeling tool for specifying model space priors that account for knowledge regarding model complexity and conditional inclusion relationships. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 795-809 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.980908 File-URL: http://hdl.handle.net/10.1080/01621459.2014.980908 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:795-809 Template-Type: ReDIF-Article 1.0 Author-Name: Harry Crane Author-X-Name-First: Harry Author-X-Name-Last: Crane Title: Clustering from Categorical Data Sequences Abstract: The three-parameter cluster model is a combinatorial stochastic process that generates categorical response sequences by randomly perturbing a fixed clustering parameter. This clear relationship between the observed data and the underlying clustering is particularly attractive in cluster analysis, in which supervised learning is a common goal and missing data is a familiar issue. The model is well equipped for this task, as it can handle missing data, perform out-of-sample inference, and accommodate both independent and dependent data sequences. Moreover, its clustering parameter lies in the unrestricted space of partitions, so that the number of clusters need not be specified beforehand. We establish these and other theoretical properties and also demonstrate the model on datasets from epidemiology, genetics, political science, and legal studies. Journal: Journal of the American Statistical Association Pages: 810-823 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.983521 File-URL: http://hdl.handle.net/10.1080/01621459.2014.983521 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:810-823 Template-Type: ReDIF-Article 1.0 Author-Name: Peter Radchenko Author-X-Name-First: Peter Author-X-Name-Last: Radchenko Author-Name: Xinghao Qiao Author-X-Name-First: Xinghao Author-X-Name-Last: Qiao Author-Name: Gareth M. James Author-X-Name-First: Gareth M. Author-X-Name-Last: James Title: Index Models for Sparsely Sampled Functional Data Abstract: The regression problem involving functional predictors has many important applications and a number of functional regression methods have been developed. However, a common complication in functional data analysis is one of sparsely observed curves, that is predictors that are observed, with error, on a small subset of the possible time points. Such sparsely observed data induce an errors-in-variables model, where one must account for measurement error in the functional predictors. Faced with sparsely observed data, most current functional regression methods simply estimate the unobserved predictors and treat them as fully observed; thus failing to account for the extra uncertainty from the measurement error. We propose a new functional errors-in-variables approach, sparse index model functional estimation (SIMFE), which uses a functional index model formulation to deal with sparsely observed predictors. SIMFE has several advantages over more traditional methods. First, the index model implements a nonlinear regression and uses an accurate supervised method to estimate the lower dimensional space into which the predictors should be projected. Second, SIMFE can be applied to both scalar and functional responses and multiple predictors. Finally, SIMFE uses a mixed effects model to effectively deal with very sparsely observed functional predictors and to correctly model the measurement error. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 824-836 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.931859 File-URL: http://hdl.handle.net/10.1080/01621459.2014.931859 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:824-836 Template-Type: ReDIF-Article 1.0 Author-Name: Karl Bruce Gregory Author-X-Name-First: Karl Bruce Author-X-Name-Last: Gregory Author-Name: Raymond J. Carroll Author-X-Name-First: Raymond J. Author-X-Name-Last: Carroll Author-Name: Veerabhadran Baladandayuthapani Author-X-Name-First: Veerabhadran Author-X-Name-Last: Baladandayuthapani Author-Name: Soumendra N. Lahiri Author-X-Name-First: Soumendra N. Author-X-Name-Last: Lahiri Title: A Two-Sample Test for Equality of Means in High Dimension Abstract: We develop a test statistic for testing the equality of two population mean vectors in the "large-p-small-n" setting. Such a test must surmount the rank-deficiency of the sample covariance matrix, which breaks down the classic Hotelling T-super-2 test. The proposed procedure, called the generalized component test, avoids full estimation of the covariance matrix by assuming that the p components admit a logical ordering such that the dependence between components is related to their displacement. The test is shown to be competitive with other recently developed methods under ARMA and long-range dependence structures and to achieve superior power for heavy-tailed data. The test does not assume equality of covariance matrices between the two populations, is robust to heteroscedasticity in the component variances, and requires very little computation time, which allows its use in settings with very large p. An analysis of mitochondrial calcium concentration in mouse cardiac muscles over time and of copy number variations in a glioblastoma multiforme dataset from The Cancer Genome Atlas are carried out to illustrate the test. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 837-849 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.934826 File-URL: http://hdl.handle.net/10.1080/01621459.2014.934826 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:837-849 Template-Type: ReDIF-Article 1.0 Author-Name: Yunxiao Chen Author-X-Name-First: Yunxiao Author-X-Name-Last: Chen Author-Name: Jingchen Liu Author-X-Name-First: Jingchen Author-X-Name-Last: Liu Author-Name: Gongjun Xu Author-X-Name-First: Gongjun Author-X-Name-Last: Xu Author-Name: Zhiliang Ying Author-X-Name-First: Zhiliang Author-X-Name-Last: Ying Title: Statistical Analysis of Q-Matrix Based Diagnostic Classification Models Abstract: Diagnostic classification models (DMCs) have recently gained prominence in educational assessment, psychiatric evaluation, and many other disciplines. Central to the model specification is the so-called Q-matrix that provides a qualitative specification of the item-attribute relationship. In this article, we develop theories on the identifiability for the Q-matrix under the DINA and the DINO models. We further propose an estimation procedure for the Q-matrix through the regularized maximum likelihood. The applicability of this procedure is not limited to the DINA or the DINO model and it can be applied to essentially all Q-matrix based DMCs. Simulation studies show that the proposed method admits high probability recovering the true Q-matrix. Furthermore, two case studies are presented. The first case is a dataset on fraction subtraction (educational application) and the second case is a subsample of the National Epidemiological Survey on Alcohol and Related Conditions concerning the social anxiety disorder (psychiatric application). Journal: Journal of the American Statistical Association Pages: 850-866 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.934827 File-URL: http://hdl.handle.net/10.1080/01621459.2014.934827 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:850-866 Template-Type: ReDIF-Article 1.0 Author-Name: Shaoting Li Author-X-Name-First: Shaoting Author-X-Name-Last: Li Author-Name: Jiahua Chen Author-X-Name-First: Jiahua Author-X-Name-Last: Chen Author-Name: Jianhua Guo Author-X-Name-First: Jianhua Author-X-Name-Last: Guo Author-Name: Bing-Yi Jing Author-X-Name-First: Bing-Yi Author-X-Name-Last: Jing Author-Name: Shui-Ying Tsang Author-X-Name-First: Shui-Ying Author-X-Name-Last: Tsang Author-Name: Hong Xue Author-X-Name-First: Hong Author-X-Name-Last: Xue Title: Likelihood Ratio Test for Multi-Sample Mixture Model and Its Application to Genetic Imprinting Abstract: Genomic imprinting is a known aspect of the etiology of many diseases. The imprinting phenomenon depicts differential expression levels of the allele depending on its parental origin. When the parental origin is unknown, the expression level has a finite normal mixture distribution. In such applications, a random sample of expression levels consists of three subsamples according to the number of minor alleles an individual possesses, of which one is the mixture and the other two are homogeneous. This understanding leads to a likelihood ratio test (LRT) for the presence of imprinting. Because of the nonregularity of the finite mixture model, the classical asymptotic conclusions on likelihood-based inference are not applicable. We show that the maximum likelihood estimator of the mixing distribution remains consistent. More interestingly, thanks to the homogeneous subsamples, the LRT statistic has an elegant and rather distinct 0.5χ-super-21 + 0.5χ-super-22 null limiting distribution. Simulation studies confirm that the limiting distribution provides precise approximations of the finite sample distributions under various parameter settings. The LRT is applied to expression data. Our analyses provide evidence for imprinting for a number of isoform expressions. Journal: Journal of the American Statistical Association Pages: 867-877 Issue: 510 Volume: 110 Year: 2015 Month: 6 X-DOI: 10.1080/01621459.2014.939272 File-URL: http://hdl.handle.net/10.1080/01621459.2014.939272 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:867-877 Template-Type: ReDIF-Article 1.0 Author-Name: Mariel M. Finucane Author-X-Name-First: Mariel M. Author-X-Name-Last: Finucane Author-Name: Christopher J. Paciorek Author-X-Name-First: Christopher J. Author-X-Name-Last: Paciorek Author-Name: Gretchen A. Stevens Author-X-Name-First: Gretchen A. Author-X-Name-Last: Stevens Author-Name: Majid Ezzati Author-X-Name-First: Majid Author-X-Name-Last: Ezzati Title: Semiparametric Bayesian Density Estimation With Disparate Data Sources: A Meta-Analysis of Global Childhood Undernutrition Abstract: Undernutrition, resulting in restricted growth, and quantified here using height-for-age z-scores, is an important contributor to childhood morbidity and mortality. Since all levels of mild, moderate, and severe undernutrition are of clinical and public health importance, it is of interest to estimate the shape of the z-scores' distributions. We present a finite normal mixture model that uses data on 4.3 million children to make annual country-specific estimates of these distributions for under-5-year-old children in the world's 141 low- and middle-income countries between 1985 and 2011. We incorporate both individual-level data when available, as well as aggregated summary statistics from studies whose individual-level data could not be obtained. We place a hierarchical Bayesian probit stick-breaking model on the mixture weights. The model allows for nonlinear changes in time, and it borrows strength in time, in covariates, and within and across regional country clusters to make estimates where data are uncertain, sparse, or missing. This work addresses three important problems that often arise in the fields of public health surveillance and global health monitoring. First, data are always incomplete. Second, different data sources commonly use different reporting metrics. Last, distributions, and especially their tails, are often of substantive interest. Journal: Journal of the American Statistical Association Pages: 889-901 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.937487 File-URL: http://hdl.handle.net/10.1080/01621459.2014.937487 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:889-901 Template-Type: ReDIF-Article 1.0 Author-Name: Christopher K. Wikle Author-X-Name-First: Christopher K. Author-X-Name-Last: Wikle Author-Name: Scott H. Holan Author-X-Name-First: Scott H. Author-X-Name-Last: Holan Title: Comment Journal: Journal of the American Statistical Association Pages: 901-903 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2015.1073083 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1073083 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:901-903 Template-Type: ReDIF-Article 1.0 Author-Name: Jim Hodges Author-X-Name-First: Jim Author-X-Name-Last: Hodges Title: Comment Journal: Journal of the American Statistical Association Pages: 903-905 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2015.1073084 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1073084 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:903-905 Template-Type: ReDIF-Article 1.0 Author-Name: Mariel M. Finucane Author-X-Name-First: Mariel M. Author-X-Name-Last: Finucane Author-Name: Christopher J. Paciorek Author-X-Name-First: Christopher J. Author-X-Name-Last: Paciorek Author-Name: Gretchen A. Stevens Author-X-Name-First: Gretchen A. Author-X-Name-Last: Stevens Author-Name: Majid Ezzati Author-X-Name-First: Majid Author-X-Name-Last: Ezzati Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 906-909 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2015.1073085 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1073085 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:906-909 Template-Type: ReDIF-Article 1.0 Author-Name: José R. Zubizarreta Author-X-Name-First: José R. Author-X-Name-Last: Zubizarreta Title: Stable Weights that Balance Covariates for Estimation With Incomplete Outcome Data Abstract: Weighting methods that adjust for observed covariates, such as inverse probability weighting, are widely used for causal inference and estimation with incomplete outcome data. Part of the appeal of such methods is that one set of weights can be used to estimate a range of treatment effects based on different outcomes, or a variety of population means for several variables. However, this appeal can be diminished in practice by the instability of the estimated weights and by the difficulty of adequately adjusting for observed covariates in some settings. To address these limitations, this article presents a new weighting method that finds the weights of minimum variance that adjust or balance the empirical distribution of the observed covariates up to levels prespecified by the researcher. This method allows the researcher to balance very precisely the means of the observed covariates and other features of their marginal and joint distributions, such as variances and correlations and also, for example, the quantiles of interactions of pairs and triples of observed covariates, thus, balancing entire two- and three-way marginals. Since the weighting method is based on a well-defined convex optimization problem, duality theory provides insight into the behavior of the variance of the optimal weights in relation to the level of covariate balance adjustment, answering the question, how much does tightening a balance constraint increases the variance of the weights? Also, the weighting method runs in polynomial time so relatively large datasets can be handled quickly. An implementation of the method is provided in the new package sbw for R. This article shows some theoretical properties of the resulting weights and illustrates their use by analyzing both a dataset from the 2010 Chilean earthquake and a simulated example. Journal: Journal of the American Statistical Association Pages: 910-922 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2015.1023805 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1023805 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:910-922 Template-Type: ReDIF-Article 1.0 Author-Name: Beom Seuk Hwang Author-X-Name-First: Beom Seuk Author-X-Name-Last: Hwang Author-Name: Zhen Chen Author-X-Name-First: Zhen Author-X-Name-Last: Chen Title: An Integrated Bayesian Nonparametric Approach for Stochastic and Variability Orders in ROC Curve Estimation: An Application to Endometriosis Diagnosis Abstract: In estimating ROC curves of multiple tests, some a priori constraints may exist, either between the healthy and diseased populations within a test or between tests within a population. In this article, we proposed an integrated modeling approach for ROC curves that jointly accounts for stochastic and variability orders. The stochastic order constrains the distributional centers of the diseased and healthy populations within a test, while the variability order constrains the distributional spreads of the tests within each of the populations. Under a Bayesian nonparametric framework, we used features of the Dirichlet process mixture to incorporate these order constraints in a natural way. We applied the proposed approach to data from the Physician Reliability Study that investigated the accuracy of diagnosing endometriosis using different clinical information. To address the issue of no gold standard in the real data, we used a sensitivity analysis approach that exploited diagnosis from a panel of experts. To demonstrate the performance of the methodology, we conducted simulation studies with varying sample sizes, distributional assumptions, and order constraints. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 923-934 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2015.1023806 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1023806 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:923-934 Template-Type: ReDIF-Article 1.0 Author-Name: Charles Hokayem Author-X-Name-First: Charles Author-X-Name-Last: Hokayem Author-Name: Christopher Bollinger Author-X-Name-First: Christopher Author-X-Name-Last: Bollinger Author-Name: James P. Ziliak Author-X-Name-First: James P. Author-X-Name-Last: Ziliak Title: The Role of CPS Nonresponse in the Measurement of Poverty Abstract: The Current Population Survey Annual Social and Economic Supplement (CPS ASEC) serves as the data source for official income, poverty, and inequality statistics in the United States. There is a concern that the rise in nonresponse to earnings questions could deteriorate data quality and distort estimates of these important metrics. We use a dataset of internal ASEC records matched to Social Security Detailed Earnings Records (DER) to study the impact of earnings nonresponse on estimates of poverty from 1997-2008. Our analysis does not treat the administrative data as the "truth"; instead, we rely on information from both administrative and survey data. We compare a "full response" poverty rate that assumes all ASEC respondents provided earnings data to the official poverty rate to gauge the nonresponse bias. On average, we find the nonresponse bias is about 1.0 percentage point. Journal: Journal of the American Statistical Association Pages: 935-945 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2015.1029576 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1029576 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:935-945 Template-Type: ReDIF-Article 1.0 Author-Name: Chao Huang Author-X-Name-First: Chao Author-X-Name-Last: Huang Author-Name: Martin Styner Author-X-Name-First: Martin Author-X-Name-Last: Styner Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Title: Clustering High-Dimensional Landmark-Based Two-Dimensional Shape Data Abstract: An important goal in image analysis is to cluster and recognize objects of interest according to the shapes of their boundaries. Clustering such objects faces at least four major challenges including a curved shape space, a high-dimensional feature space, a complex spatial correlation structure, and shape variation associated with some covariates (e.g., age or gender). The aim of this article is to develop a penalized model-based clustering framework to cluster landmark-based planar shape data, while explicitly addressing these challenges. Specifically, a mixture of offset-normal shape factor analyzers (MOSFA) is proposed with mixing proportions defined through a regression model (e.g., logistic) and an offset-normal shape distribution in each component for data in the curved shape space. A latent factor analysis model is introduced to explicitly model the complex spatial correlation. A penalized likelihood approach with both adaptive pairwise fused Lasso penalty function and L2 penalty function is used to automatically realize variable selection via thresholding and deliver a sparse solution. Our real data analysis has confirmed the excellent finite-sample performance of MOSFA in revealing meaningful clusters in the corpus callosum shape data obtained from the Attention Deficit Hyperactivity Disorder-200 (ADHD-200) study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 946-961 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2015.1034802 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1034802 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:946-961 Template-Type: ReDIF-Article 1.0 Author-Name: Yi-Juan Hu Author-X-Name-First: Yi-Juan Author-X-Name-Last: Hu Author-Name: Wei Sun Author-X-Name-First: Wei Author-X-Name-Last: Sun Author-Name: Jung-Ying Tzeng Author-X-Name-First: Jung-Ying Author-X-Name-Last: Tzeng Author-Name: Charles M. Perou Author-X-Name-First: Charles M. Author-X-Name-Last: Perou Title: Proper Use of Allele-Specific Expression Improves Statistical Power for cis-eQTL Mapping with RNA-Seq Data Abstract: Studies of expression quantitative trait loci (eQTLs) offer insight into the molecular mechanisms of loci that were found to be associated with complex diseases and the mechanisms can be classified into cis- and trans-acting regulation. At present, high-throughput RNA sequencing (RNA-seq) is rapidly replacing expression microarrays to assess gene expression abundance. Unlike microarrays that only measure the total expression of each gene, RNA-seq also provides information on allele-specific expression (ASE), which can be used to distinguish cis-eQTLs from trans-eQTLs and, more importantly, enhance cis-eQTL mapping. However, assessing the cis-effect of a candidate eQTL on a gene requires knowledge of the haplotypes connecting the candidate eQTL and the gene, which can not be inferred with certainty. The existing two-stage approach that first phases the candidate eQTL against the gene and then treats the inferred phase as observed in the association analysis tends to attenuate the estimated cis-effect and reduce the power for detecting a cis-eQTL. In this article, we provide a maximum-likelihood framework for cis-eQTL mapping with RNA-seq data. Our approach integrates the inference of haplotypes and the association analysis into a single stage, and is thus unbiased and statistically powerful. We also develop a pipeline for performing a comprehensive scan of all local eQTLs for all genes in the genome by controlling for false discovery rate, and implement the methods in a computationally efficient software program. The advantages of the proposed methods over the existing ones are demonstrated through realistic simulation studies and an application to empirical breast cancer data from The Cancer Genome Atlas project. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 962-974 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2015.1038449 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1038449 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:962-974 Template-Type: ReDIF-Article 1.0 Author-Name: Wei Sun Author-X-Name-First: Wei Author-X-Name-Last: Sun Author-Name: Yufeng Liu Author-X-Name-First: Yufeng Author-X-Name-Last: Liu Author-Name: James J. Crowley Author-X-Name-First: James J. Author-X-Name-Last: Crowley Author-Name: Ting-Huei Chen Author-X-Name-First: Ting-Huei Author-X-Name-Last: Chen Author-Name: Hua Zhou Author-X-Name-First: Hua Author-X-Name-Last: Zhou Author-Name: Haitao Chu Author-X-Name-First: Haitao Author-X-Name-Last: Chu Author-Name: Shunping Huang Author-X-Name-First: Shunping Author-X-Name-Last: Huang Author-Name: Pei-Fen Kuan Author-X-Name-First: Pei-Fen Author-X-Name-Last: Kuan Author-Name: Yuan Li Author-X-Name-First: Yuan Author-X-Name-Last: Li Author-Name: Darla Miller Author-X-Name-First: Darla Author-X-Name-Last: Miller Author-Name: Ginger Shaw Author-X-Name-First: Ginger Author-X-Name-Last: Shaw Author-Name: Yichao Wu Author-X-Name-First: Yichao Author-X-Name-Last: Wu Author-Name: Vasyl Zhabotynsky Author-X-Name-First: Vasyl Author-X-Name-Last: Zhabotynsky Author-Name: Leonard McMillan Author-X-Name-First: Leonard Author-X-Name-Last: McMillan Author-Name: Fei Zou Author-X-Name-First: Fei Author-X-Name-Last: Zou Author-Name: Patrick F. Sullivan Author-X-Name-First: Patrick F. Author-X-Name-Last: Sullivan Author-Name: Fernando Pardo-Manuel De Villena Author-X-Name-First: Fernando Pardo-Manuel Author-X-Name-Last: De Villena Title: IsoDOT Detects Differential RNA-Isoform Expression/Usage With Respect to a Categorical or Continuous Covariate With High Sensitivity and Specificity Abstract: We have developed a statistical method named IsoDOT to assess differential isoform expression (DIE) and differential isoform usage (DIU) using RNA-seq data. Here isoform usage refers to relative isoform expression given the total expression of the corresponding gene. IsoDOT performs two tasks that cannot be accomplished by existing methods: to test DIE/DIU with respect to a continuous covariate, and to test DIE/DIU for one case versus one control. The latter task is not an uncommon situation in practice, for example, comparing the paternal and maternal alleles of one individual or comparing tumor and normal samples of one cancer patient. Simulation studies demonstrate the high sensitivity and specificity of IsoDOT. We apply IsoDOT to study the effects of haloperidol treatment on the mouse transcriptome and identify a group of genes whose isoform usages respond to haloperidol treatment. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 975-986 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2015.1040880 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1040880 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:975-986 Template-Type: ReDIF-Article 1.0 Author-Name: Hang J. Kim Author-X-Name-First: Hang J. Author-X-Name-Last: Kim Author-Name: Lawrence H. Cox Author-X-Name-First: Lawrence H. Author-X-Name-Last: Cox Author-Name: Alan F. Karr Author-X-Name-First: Alan F. Author-X-Name-Last: Karr Author-Name: Jerome P. Reiter Author-X-Name-First: Jerome P. Author-X-Name-Last: Reiter Author-Name: Quanli Wang Author-X-Name-First: Quanli Author-X-Name-Last: Wang Title: Simultaneous Edit-Imputation for Continuous Microdata Abstract: Many statistical organizations collect data that are expected to satisfy linear constraints; as examples, component variables should sum to total variables, and ratios of pairs of variables should be bounded by expert-specified constants. When reported data violate constraints, organizations identify and replace values potentially in error in a process known as edit-imputation. To date, most approaches separate the error localization and imputation steps, typically using optimization methods to identify the variables to change followed by hot deck imputation. We present an approach that fully integrates editing and imputation for continuous microdata under linear constraints. Our approach relies on a Bayesian hierarchical model that includes (i) a flexible joint probability model for the underlying true values of the data with support only on the set of values that satisfy all editing constraints, (ii) a model for latent indicators of the variables that are in error, and (iii) a model for the reported responses for variables in error. We illustrate the potential advantages of the Bayesian editing approach over existing approaches using simulation studies. We apply the model to edit faulty data from the 2007 U.S. Census of Manufactures. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 987-999 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2015.1040881 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1040881 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:987-999 Template-Type: ReDIF-Article 1.0 Author-Name: Li-Chu Chien Author-X-Name-First: Li-Chu Author-X-Name-Last: Chien Author-Name: Yuh-Jenn Wu Author-X-Name-First: Yuh-Jenn Author-X-Name-Last: Wu Author-Name: Chao A. Hsiung Author-X-Name-First: Chao A. Author-X-Name-Last: Hsiung Author-Name: Lu-Hai Wang Author-X-Name-First: Lu-Hai Author-X-Name-Last: Wang Author-Name: I-Shou Chang Author-X-Name-First: I-Shou Author-X-Name-Last: Chang Title: Smoothed Lexis Diagrams With Applications to Lung and Breast Cancer Trends in Taiwan Abstract: Cancer surveillance research often begins with a rate matrix, also called a Lexis diagram, of cancer incidence derived from cancer registry and census data. Lexis diagrams with 3- or 5-year intervals for age group and for calendar year of diagnosis are often considered. This simple smoothing approach suffers from a significant limitation; important details useful in studying time trends may be lost in the averaging process involved in generating a summary rate. This article constructs a smoothed Lexis diagram and indicates its use in cancer surveillance research. Specifically, we use a Poisson model to describe the relationship between the number of new cases, the number of people at risk, and a smoothly varying incidence rate for the study of the incidence rate function. Based on the Poisson model, we use the standard Lexis diagram to introduce priors through the coefficients of Bernstein polynomials and propose a Bayesian approach to construct a smoothed Lexis diagram for the study of the effects of age, period, and cohort on incidence rates in terms of straightforward graphical displays. These include the age-specific rates by year of birth, age-specific rates by year of diagnosis, year-specific rates by age of diagnosis, and cohort-specific rates by age of diagnosis. We illustrate our approach by studying the trends in lung and breast cancer incidence in Taiwan. We find that for nearly every age group the incidence rates for lung adenocarcinoma and female invasive breast cancer increased rapidly in the past two decades and those for male lung squamous cell carcinoma started to decrease, which is consistent with the decline in the male smoking rate that began in 1985. Since the analyses indicate strong age, period, and cohort effects, it seems that both lung cancer and breast cancer will become more important public health problems in Taiwan. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1000-1012 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2015.1042106 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1042106 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1000-1012 Template-Type: ReDIF-Article 1.0 Author-Name: Kosuke Imai Author-X-Name-First: Kosuke Author-X-Name-Last: Imai Author-Name: Marc Ratkovic Author-X-Name-First: Marc Author-X-Name-Last: Ratkovic Title: Robust Estimation of Inverse Probability Weights for Marginal Structural Models Abstract: Marginal structural models (MSMs) are becoming increasingly popular as a tool for causal inference from longitudinal data. Unlike standard regression models, MSMs can adjust for time-dependent observed confounders while avoiding the bias due to the direct adjustment for covariates affected by the treatment. Despite their theoretical appeal, a main practical difficulty of MSMs is the required estimation of inverse probability weights. Previous studies have found that MSMs can be highly sensitive to misspecification of treatment assignment model even when the number of time periods is moderate. To address this problem, we generalize the covariate balancing propensity score (CBPS) methodology of Imai and Ratkovic to longitudinal analysis settings. The CBPS estimates the inverse probability weights such that the resulting covariate balance is improved. Unlike the standard approach, the proposed methodology incorporates all covariate balancing conditions across multiple time periods. Since the number of these conditions grows exponentially as the number of time period increases, we also propose a low-rank approximation to ease the computational burden. Our simulation and empirical studies suggest that the CBPS significantly improves the empirical performance of MSMs by making the treatment assignment model more robust to misspecification. Open-source software is available for implementing the proposed methods. Journal: Journal of the American Statistical Association Pages: 1013-1023 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.956872 File-URL: http://hdl.handle.net/10.1080/01621459.2014.956872 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1013-1023 Template-Type: ReDIF-Article 1.0 Author-Name: Karel Vermeulen Author-X-Name-First: Karel Author-X-Name-Last: Vermeulen Author-Name: Stijn Vansteelandt Author-X-Name-First: Stijn Author-X-Name-Last: Vansteelandt Title: Bias-Reduced Doubly Robust Estimation Abstract: Over the past decade, doubly robust estimators have been proposed for a variety of target parameters in causal inference and missing data models. These are asymptotically unbiased when at least one of two nuisance working models is correctly specified, regardless of which. While their asymptotic distribution is not affected by the choice of root-n consistent estimators of the nuisance parameters indexing these working models when all working models are correctly specified, this choice of estimators can have a dramatic impact under misspecification of at least one working model. In this article, we will therefore propose a simple and generic estimation principle for the nuisance parameters indexing each of the working models, which is designed to improve the performance of the doubly robust estimator of interest, relative to the default use of maximum likelihood estimators for the nuisance parameters. The proposed approach locally minimizes the squared first-order asymptotic bias of the doubly robust estimator under misspecification of both working models and results in doubly robust estimators with easy-to-calculate asymptotic variance. It moreover improves the stability of the weights in those doubly robust estimators which invoke inverse probability weighting. Simulation studies confirm the desirable finite-sample performance of the proposed estimators. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1024-1036 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.958155 File-URL: http://hdl.handle.net/10.1080/01621459.2014.958155 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1024-1036 Template-Type: ReDIF-Article 1.0 Author-Name: Alexander Volfovsky Author-X-Name-First: Alexander Author-X-Name-Last: Volfovsky Author-Name: Peter D. Hoff Author-X-Name-First: Peter D. Author-X-Name-Last: Hoff Title: Testing for Nodal Dependence in Relational Data Matrices Abstract: Relational data are often represented as a square matrix, the entries of which record the relationships between pairs of objects. Many statistical methods for the analysis of such data assume some degree of similarity or dependence between objects in terms of the way they relate to each other. However, formal tests for such dependence have not been developed. We provide a test for such dependence using the framework of the matrix normal model, a type of multivariate normal distribution parameterized in terms of row- and column-specific covariance matrices. We develop a likelihood ratio test (LRT) for row and column dependence based on the observation of a single relational data matrix. We obtain a reference distribution for the LRT statistic, thereby providing an exact test for the presence of row or column correlations in a square relational data matrix. Additionally, we provide extensions of the test to accommodate common features of such data, such as undefined diagonal entries, a nonzero mean, multiple observations, and deviations from normality. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1037-1046 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.965777 File-URL: http://hdl.handle.net/10.1080/01621459.2014.965777 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1037-1046 Template-Type: ReDIF-Article 1.0 Author-Name: Bailey K. Fosdick Author-X-Name-First: Bailey K. Author-X-Name-Last: Fosdick Author-Name: Peter D. Hoff Author-X-Name-First: Peter D. Author-X-Name-Last: Hoff Title: Testing and Modeling Dependencies Between a Network and Nodal Attributes Abstract: Network analysis is often focused on characterizing the dependencies between network relations and node-level attributes. Potential relationships are typically explored by modeling the network as a function of the nodal attributes or by modeling the attributes as a function of the network. These methods require specification of the exact nature of the association between the network and attributes, reduce the network data to a small number of summary statistics, and are unable to provide predictions simultaneously for missing attribute and network information. Existing methods that model the attributes and network jointly also assume the data are fully observed. In this article, we introduce a unified approach to analysis that addresses these shortcomings. We use a previously developed latent variable model to obtain a low-dimensional representation of the network in terms of node-specific network factors. We introduce a novel testing procedure to determine if dependencies exist between the network factors and attributes as a surrogate for a test of dependence between the network and attributes. We also present a joint model for the network relations and attributes, for use if the hypothesis of independence is rejected, which can capture a variety of dependence patterns and be used to make inference and predictions for missing observations. Journal: Journal of the American Statistical Association Pages: 1047-1056 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2015.1008697 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1008697 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1047-1056 Template-Type: ReDIF-Article 1.0 Author-Name: Laura Azzimonti Author-X-Name-First: Laura Author-X-Name-Last: Azzimonti Author-Name: Laura M. Sangalli Author-X-Name-First: Laura M. Author-X-Name-Last: Sangalli Author-Name: Piercesare Secchi Author-X-Name-First: Piercesare Author-X-Name-Last: Secchi Author-Name: Maurizio Domanin Author-X-Name-First: Maurizio Author-X-Name-Last: Domanin Author-Name: Fabio Nobile Author-X-Name-First: Fabio Author-X-Name-Last: Nobile Title: Blood Flow Velocity Field Estimation Via Spatial Regression With PDE Penalization Abstract: We propose an innovative method for the accurate estimation of surfaces and spatial fields when prior knowledge of the phenomenon under study is available. The prior knowledge included in the model derives from physics, physiology, or mechanics of the problem at hand, and is formalized in terms of a partial differential equation governing the phenomenon behavior, as well as conditions that the phenomenon has to satisfy at the boundary of the problem domain. The proposed models exploit advanced scientific computing techniques and specifically make use of the finite element method. The estimators have a penalized regression form and the usual inferential tools are derived. Both the pointwise and the areal data frameworks are considered. The driving application concerns the estimation of the blood flow velocity field in a section of a carotid artery, using data provided by echo-color Doppler. This applied problem arises within a research project that aims at studying atherosclerosis pathogenesis. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1057-1071 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.946036 File-URL: http://hdl.handle.net/10.1080/01621459.2014.946036 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1057-1071 Template-Type: ReDIF-Article 1.0 Author-Name: C. Villa Author-X-Name-First: C. Author-X-Name-Last: Villa Author-Name: S. G. Walker Author-X-Name-First: S. G. Author-X-Name-Last: Walker Title: An Objective Approach to Prior Mass Functions for Discrete Parameter Spaces Abstract: We present a novel approach to constructing objective prior distributions for discrete parameter spaces. These types of parameter spaces are particularly problematic, as it appears that common objective procedures to design prior distributions are problem specific. We propose an objective criterion, based on loss functions, instead of trying to define objective probabilities directly. We systematically apply this criterion to a series of discrete scenarios, previously considered in the literature, and compare the priors. The proposed approach applies to any discrete parameter space, making it appealing as it does not involve different concepts according to the model. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1072-1082 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.946319 File-URL: http://hdl.handle.net/10.1080/01621459.2014.946319 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1072-1082 Template-Type: ReDIF-Article 1.0 Author-Name: Tianhao Wang Author-X-Name-First: Tianhao Author-X-Name-Last: Wang Author-Name: Yingcun Xia Author-X-Name-First: Yingcun Author-X-Name-Last: Xia Title: Whittle Likelihood Estimation of Nonlinear Autoregressive Models With Moving Average Residuals Abstract: The Whittle likelihood estimation (WLE) has played a fundamental role in the development of both theory and computation of time series analysis. However, WLE is only applicable to models whose theoretical spectral density function (SDF) is known up to the parameters in the models. In this article, we propose a residual-based WLE, called extended WLE (XWLE), which can estimate models with their SDFs only partially available, including many popular time series models with correlated residuals. Asymptotic properties of XWLE are established. In particular, XWLE is asymptotically equivalent to WLE in estimating linear ARMA models, and is also capable of estimating nonlinear AR models with MA residuals and even with exogenous variables. The finite-sample performances of XWLE are checked by simulated examples and real data analysis. Journal: Journal of the American Statistical Association Pages: 1083-1099 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.946513 File-URL: http://hdl.handle.net/10.1080/01621459.2014.946513 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1083-1099 Template-Type: ReDIF-Article 1.0 Author-Name: Graciela Boente Author-X-Name-First: Graciela Author-X-Name-Last: Boente Author-Name: Matías Salibian-Barrera Author-X-Name-First: Matías Author-X-Name-Last: Salibian-Barrera Title: S-Estimators for Functional Principal Component Analysis Abstract: Principal component analysis is a widely used technique that provides an optimal lower-dimensional approximation to multivariate or functional datasets. These approximations can be very useful in identifying potential outliers among high-dimensional or functional observations. In this article, we propose a new class of estimators for principal components based on robust scale estimators. For a fixed dimension q, we robustly estimate the q-dimensional linear space that provides the best prediction for the data, in the sense of minimizing the sum of robust scale estimators of the coordinates of the residuals. We also study an extension to the infinite-dimensional case. Our method is consistent for elliptical random vectors, and is Fisher consistent for elliptically distributed random elements on arbitrary Hilbert spaces. Numerical experiments show that our proposal is highly competitive when compared with other methods. We illustrate our approach on a real dataset, where the robust estimator discovers atypical observations that would have been missed otherwise. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1100-1111 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.946991 File-URL: http://hdl.handle.net/10.1080/01621459.2014.946991 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1100-1111 Template-Type: ReDIF-Article 1.0 Author-Name: Jian Zhu Author-X-Name-First: Jian Author-X-Name-Last: Zhu Author-Name: Trivellore E. Raghunathan Author-X-Name-First: Trivellore E. Author-X-Name-Last: Raghunathan Title: Convergence Properties of a Sequential Regression Multiple Imputation Algorithm Abstract: A sequential regression or chained equations imputation approach uses a Gibbs sampling-type iterative algorithm that imputes the missing values using a sequence of conditional regression models. It is a flexible approach for handling different types of variables and complex data structures. Many simulation studies have shown that the multiple imputation inferences based on this procedure have desirable repeated sampling properties. However, a theoretical weakness of this approach is that the specification of a set of conditional regression models may not be compatible with a joint distribution of the variables being imputed. Hence, the convergence properties of the iterative algorithm are not well understood. This article develops conditions for convergence and assesses the properties of inferences from both compatible and incompatible sequence of regression models. The results are established for the missing data pattern where each subject may be missing a value on at most one variable. The sequence of regression models are assumed to be empirically good fit for the data chosen by the imputer based on appropriate model diagnostics. The results are used to develop criteria for the choice of regression models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1112-1124 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.948117 File-URL: http://hdl.handle.net/10.1080/01621459.2014.948117 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1112-1124 Template-Type: ReDIF-Article 1.0 Author-Name: Hua Yun Chen Author-X-Name-First: Hua Yun Author-X-Name-Last: Chen Author-Name: Daniel E. Rader Author-X-Name-First: Daniel E. Author-X-Name-Last: Rader Author-Name: Mingyao Li Author-X-Name-First: Mingyao Author-X-Name-Last: Li Title: Likelihood Inferences on Semiparametric Odds Ratio Model Abstract: A flexible semiparametric odds ratio model has been proposed to unify and to extend both the log-linear model and the joint normal model for data with a mix of discrete and continuous variables. The semiparametric odds ratio model is particularly useful for analyzing biased sampling designs. However, statistical inference of the model has not been systematically studied when more than one nonparametric component is involved in the model. In this article, we study the maximum semiparametric likelihood approach to estimation and inference of the semiparametric odds ratio model. We show that the maximum semiparametric likelihood estimator of the odds ratio parameter is consistent and asymptotically normally distributed. We also establish statistical inference under a misspecified semiparametric odds ratio model, which is important when handling weak identifiability in conditionally specified models under biased sampling designs. We use simulation studies to demonstrate that the proposed approaches have satisfactory finite sample performance. Finally, we illustrate the proposed approach by analyzing multiple traits in a genome-wide association study of high-density lipid protein. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1125-1135 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.948544 File-URL: http://hdl.handle.net/10.1080/01621459.2014.948544 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1125-1135 Template-Type: ReDIF-Article 1.0 Author-Name: Jiming Jiang Author-X-Name-First: Jiming Author-X-Name-Last: Jiang Author-Name: Thuan Nguyen Author-X-Name-First: Thuan Author-X-Name-Last: Nguyen Author-Name: J. Sunil Rao Author-X-Name-First: J. Sunil Author-X-Name-Last: Rao Title: The E-MS Algorithm: Model Selection With Incomplete Data Abstract: We propose a procedure associated with the idea of the E-M algorithm for model selection in the presence of missing data. The idea extends the concept of parameters to include both the model and the parameters under the model, and thus allows the model to be part of the E-M iterations. We develop the procedure, known as the E-MS algorithm, under the assumption that the class of candidate models is finite. Some special cases of the procedure are considered, including E-MS with the generalized information criteria (GIC), and E-MS with the adaptive fence (AF; Jiang et al.). We prove numerical convergence of the E-MS algorithm as well as consistency in model selection of the limiting model of the E-MS convergence, for E-MS with GIC and E-MS with AF. We study the impact on model selection of different missing data mechanisms. Furthermore, we carry out extensive simulation studies on the finite-sample performance of the E-MS with comparisons to other procedures. The methodology is also illustrated on a real data analysis involving QTL mapping for an agricultural study on barley grains. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1136-1147 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.948545 File-URL: http://hdl.handle.net/10.1080/01621459.2014.948545 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1136-1147 Template-Type: ReDIF-Article 1.0 Author-Name: Deng Pan Author-X-Name-First: Deng Author-X-Name-Last: Pan Author-Name: Haijin He Author-X-Name-First: Haijin Author-X-Name-Last: He Author-Name: Xinyuan Song Author-X-Name-First: Xinyuan Author-X-Name-Last: Song Author-Name: Liuquan Sun Author-X-Name-First: Liuquan Author-X-Name-Last: Sun Title: Regression Analysis of Additive Hazards Model With Latent Variables Abstract: We propose an additive hazards model with latent variables to investigate the observed and latent risk factors of the failure time of interest. Each latent risk factor is characterized by correlated observed variables through a confirmatory factor analysis model. We develop a hybrid procedure that combines the expectation-maximization (EM) algorithm and the borrow-strength estimation approach to estimate the model parameters. We establish the consistency and asymptotic normality of the parameter estimators. Various nice features, including finite sample performance of the proposed methodology, are demonstrated by simulation studies. Our model is applied to a study concerning the risk factors of chronic kidney disease for Type 2 diabetic patients. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1148-1159 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.950083 File-URL: http://hdl.handle.net/10.1080/01621459.2014.950083 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1148-1159 Template-Type: ReDIF-Article 1.0 Author-Name: Yumou Qiu Author-X-Name-First: Yumou Author-X-Name-Last: Qiu Author-Name: Song Xi Chen Author-X-Name-First: Song Xi Author-X-Name-Last: Chen Title: Bandwidth Selection for High-Dimensional Covariance Matrix Estimation Abstract: The banding estimator of Bickel and Levina and its tapering version of Cai, Zhang, and Zhou are important high-dimensional covariance estimators. Both estimators require a bandwidth parameter. We propose a bandwidth selector for the banding estimator by minimizing an empirical estimate of the expected squared Frobenius norms of the estimation error matrix. The ratio consistency of the bandwidth selector is established. We provide a lower bound for the coverage probability of the underlying bandwidth being contained in an interval around the bandwidth estimate. Extensions to the bandwidth selection for the tapering estimator and threshold level selection for the thresholding covariance estimator are made. Numerical simulations and a case study on sonar spectrum data are conducted to demonstrate the proposed approaches. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1160-1174 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.950375 File-URL: http://hdl.handle.net/10.1080/01621459.2014.950375 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1160-1174 Template-Type: ReDIF-Article 1.0 Author-Name: Chun Yip Yau Author-X-Name-First: Chun Yip Author-X-Name-Last: Yau Author-Name: Chong Man Tang Author-X-Name-First: Chong Man Author-X-Name-Last: Tang Author-Name: Thomas C. M. Lee Author-X-Name-First: Thomas C. M. Author-X-Name-Last: Lee Title: Estimation of Multiple-Regime Threshold Autoregressive Models With Structural Breaks Abstract: The threshold autoregressive (TAR) model is a class of nonlinear time series models that have been widely used in many areas. Due to its nonlinear nature, one major difficulty in fitting a TAR model is the estimation of the thresholds. As a first contribution, this article develops an automatic procedure to estimate the number and values of the thresholds, as well as the corresponding AR order and parameter values in each regime. These parameter estimates are defined as the minimizers of an objective function derived from the minimum description length (MDL) principle. A genetic algorithm (GA) is constructed to efficiently solve the associated minimization problem. The second contribution of this article is the extension of this framework to piecewise TAR modeling; that is, the time series is partitioned into different segments for which each segment can be adequately modeled by a TAR model, while models from adjacent segments are different. For such piecewise TAR modeling, a procedure is developed to estimate the number and locations of the breakpoints, together with all other parameters in each segment. Desirable theoretical results are derived to lend support to the proposed methodology. Simulation experiments and an application to an U.S. GNP data are used to illustrate the empirical performances of the methodology. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1175-1186 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.954706 File-URL: http://hdl.handle.net/10.1080/01621459.2014.954706 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1175-1186 Template-Type: ReDIF-Article 1.0 Author-Name: Hongyuan Cao Author-X-Name-First: Hongyuan Author-X-Name-Last: Cao Author-Name: Mathew M. Churpek Author-X-Name-First: Mathew M. Author-X-Name-Last: Churpek Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Jason P. Fine Author-X-Name-First: Jason P. Author-X-Name-Last: Fine Title: Analysis of the Proportional Hazards Model With Sparse Longitudinal Covariates Abstract: Regression analysis of censored failure observations via the proportional hazards model permits time-varying covariates that are observed at death times. In practice, such longitudinal covariates are typically sparse and only measured at infrequent and irregularly spaced follow-up times. Full likelihood analyses of joint models for longitudinal and survival data impose stringent modeling assumptions that are difficult to verify in practice and that are complicated both inferentially and computationally. In this article, a simple kernel weighted score function is proposed with minimal assumptions. Two scenarios are considered: half kernel estimation in which observation ceases at the time of the event and full kernel estimation for data where observation may continue after the event, as with recurrent events data. It is established that these estimators are consistent and asymptotically normal. However, they converge at rates that are slower than the parametric rates that may be achieved with fully observed covariates, with the full kernel method achieving an optimal convergence rate that is superior to that of the half kernel method. Simulation results demonstrate that the large sample approximations are adequate for practical use and may yield improved performance relative to last value carried forward approach and joint modeling method. The analysis of the data from a cardiac arrest study demonstrates the utility of the proposed methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1187-1196 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.957289 File-URL: http://hdl.handle.net/10.1080/01621459.2014.957289 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1187-1196 Template-Type: ReDIF-Article 1.0 Author-Name: Claudia Kirch Author-X-Name-First: Claudia Author-X-Name-Last: Kirch Author-Name: Birte Muhsal Author-X-Name-First: Birte Author-X-Name-Last: Muhsal Author-Name: Hernando Ombao Author-X-Name-First: Hernando Author-X-Name-Last: Ombao Title: Detection of Changes in Multivariate Time Series With Application to EEG Data Abstract: The primary contributions of this article are rigorously developed novel statistical methods for detecting change points in multivariate time series. We extend the class of score type change point statistics considered in 2007 by Hušková, Prášková, and Steinebach to the vector autoregressive (VAR) case and the epidemic change alternative. Our proposed procedures do not require the observed time series to actually follow the VAR model. Instead, following the strategy implicitly employed by practitioners, our approach takes model misspecification into account so that our detection procedure uses the model background merely for feature extraction. We derive the asymptotic distributions of our test statistics and show that our procedure has asymptotic power of 1. The proposed test statistics require the estimation of the inverse of the long-run covariance matrix which is particularly difficult in higher-dimensional settings (i.e., where the dimension of the time series and the dimension of the parameter vector are both large). Thus we robustify the proposed test statistics and investigate their finite sample properties via extensive numerical experiments. Finally, we apply our procedure to electroencephalograms and demonstrate its potential impact in identifying change points in complex brain processes during a cognitive motor task. Journal: Journal of the American Statistical Association Pages: 1197-1216 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.957545 File-URL: http://hdl.handle.net/10.1080/01621459.2014.957545 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1197-1216 Template-Type: ReDIF-Article 1.0 Author-Name: David Azriel Author-X-Name-First: David Author-X-Name-Last: Azriel Author-Name: Armin Schwartzman Author-X-Name-First: Armin Author-X-Name-Last: Schwartzman Title: The Empirical Distribution of a Large Number of Correlated Normal Variables Abstract: Motivated by the advent of high-dimensional, highly correlated data, this work studies the limit behavior of the empirical cumulative distribution function (ecdf) of standard normal random variables under arbitrary correlation. First, we provide a necessary and sufficient condition for convergence of the ecdf to the standard normal distribution. Next, under general correlation, we show that the ecdf limit is a random, possible infinite, mixture of normal distribution functions that depends on a number of latent variables and can serve as an asymptotic approximation to the ecdf in high dimensions. We provide conditions under which the dimension of the ecdf limit, defined as the smallest number of effective latent variables, is finite. Estimates of the latent variables are provided and their consistency proved. We demonstrate these methods in a real high-dimensional data example from brain imaging where it is shown that, while the study exhibits apparently strongly significant results, they can be entirely explained by correlation, as captured by the asymptotic approximation developed here. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1217-1228 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.958156 File-URL: http://hdl.handle.net/10.1080/01621459.2014.958156 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1217-1228 Template-Type: ReDIF-Article 1.0 Author-Name: Xanthi Pedeli Author-X-Name-First: Xanthi Author-X-Name-Last: Pedeli Author-Name: Anthony C. Davison Author-X-Name-First: Anthony C. Author-X-Name-Last: Davison Author-Name: Konstantinos Fokianos Author-X-Name-First: Konstantinos Author-X-Name-Last: Fokianos Title: Likelihood Estimation for the INAR(p) Model by Saddlepoint Approximation Abstract: Saddlepoint techniques have been used successfully in many applications, owing to the high accuracy with which they can approximate intractable densities and tail probabilities. This article concerns their use for the estimation of high-order integer-valued autoregressive, INAR(p), processes. Conditional least squares estimation and maximum likelihood estimation have been proposed for INAR(p) models, but the first is inefficient for estimating parametric models, and the second becomes difficult to implement as the order p increases. We propose a simple saddlepoint approximation to the log-likelihood that performs well even in the tails of the distribution and with complicated INAR models. We consider Poisson and negative binomial innovations, and show empirically that the estimator that maximises the saddlepoint approximation behaves very similarly to the maximum likelihood estimator in realistic settings. The approach is applied to data on meningococcal disease counts. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1229-1238 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.983230 File-URL: http://hdl.handle.net/10.1080/01621459.2014.983230 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1229-1238 Template-Type: ReDIF-Article 1.0 Author-Name: Lo-Bin Chang Author-X-Name-First: Lo-Bin Author-X-Name-Last: Chang Author-Name: Donald Geman Author-X-Name-First: Donald Author-X-Name-Last: Geman Title: Tracking Cross-Validated Estimates of Prediction Error as Studies Accumulate Abstract: In recent years, "reproducibility" has emerged as a key factor in evaluating x applications of statistics to the biomedical sciences, for example, learning predictors of disease phenotypes from high-throughput "omics" data. In particular, "validation" is undermined when error rates on newly acquired data are sharply higher than those originally reported. More precisely, when data are collected from m "studies" representing possibly different subphenotypes, more generally different mixtures of subphenotypes, the error rates in cross-study validation (CSV) are observed to be larger than those obtained in ordinary randomized cross-validation (RCV), although the "gap" seems to close as m increases. Whereas these findings are hardly surprising for a heterogenous underlying population, this discrepancy is then seen as a barrier to translational research. We provide a statistical formulation in the large-sample limit: studies themselves are modeled as components of a mixture and all error rates are optimal (Bayes) for a two-class problem. Our results cohere with the trends observed in practice and suggest what is likely to be observed with large samples and consistent density estimators, namely, that the CSV error rate exceeds the RCV error rates for any m, the latter (appropriately averaged) increases with m, and both converge to the optimal rate for the whole population. Journal: Journal of the American Statistical Association Pages: 1239-1247 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.1002926 File-URL: http://hdl.handle.net/10.1080/01621459.2014.1002926 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1239-1247 Template-Type: ReDIF-Article 1.0 Author-Name: Faming Liang Author-X-Name-First: Faming Author-X-Name-Last: Liang Author-Name: Qifan Song Author-X-Name-First: Qifan Author-X-Name-Last: Song Author-Name: Peihua Qiu Author-X-Name-First: Peihua Author-X-Name-Last: Qiu Title: An Equivalent Measure of Partial Correlation Coefficients for High-Dimensional Gaussian Graphical Models Abstract: Gaussian graphical models (GGMs) are frequently used to explore networks, such as gene regulatory networks, among a set of variables. Under the classical theory of GGMs, the construction of Gaussian graphical networks amounts to finding the pairs of variables with nonzero partial correlation coefficients. However, this is infeasible for high-dimensional problems for which the number of variables is larger than the sample size. In this article, we propose a new measure of partial correlation coefficient, which is evaluated with a reduced conditional set and thus feasible for high-dimensional problems. Under the Markov property and adjacency faithfulness conditions, the new measure of partial correlation coefficient is equivalent to the true partial correlation coefficient in construction of Gaussian graphical networks. Based on the new measure of partial correlation coefficient, we propose a multiple hypothesis test-based method for the construction of Gaussian graphical networks. Furthermore, we establish the consistency of the proposed method under mild conditions. The proposed method outperforms the existing methods, such as the PC, graphical Lasso, nodewise regression, and qp-average methods, especially for the problems for which a large number of indirect associations are present. The proposed method has a computational complexity of nearly O(p-super-2), and is flexible in data integration, network comparison, and covariate adjustment. Journal: Journal of the American Statistical Association Pages: 1248-1265 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2015.1012391 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1012391 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1248-1265 Template-Type: ReDIF-Article 1.0 Author-Name: Kehui Chen Author-X-Name-First: Kehui Author-X-Name-Last: Chen Author-Name: Jing Lei Author-X-Name-First: Jing Author-X-Name-Last: Lei Title: Localized Functional Principal Component Analysis Abstract: We propose localized functional principal component analysis (LFPCA), looking for orthogonal basis functions with localized support regions that explain most of the variability of a random process. The LFPCA is formulated as a convex optimization problem through a novel deflated Fantope localization method and is implemented through an efficient algorithm to obtain the global optimum. We prove that the proposed LFPCA converges to the original functional principal component analysis (FPCA) when the tuning parameters are chosen appropriately. Simulation shows that the proposed LFPCA with tuning parameters chosen by cross-validation can almost perfectly recover the true eigenfunctions and significantly improve the estimation accuracy when the eigenfunctions are truly supported on some subdomains. In the scenario that the original eigenfunctions are not localized, the proposed LFPCA also serves as a nice tool in finding orthogonal basis functions that balance between interpretability and the capability of explaining variability of the data. The analyses of a country mortality data reveal interesting features that cannot be found by standard FPCA methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1266-1275 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2015.1016225 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1016225 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1266-1275 Template-Type: ReDIF-Article 1.0 Author-Name: Jan De Neve Author-X-Name-First: Jan Author-X-Name-Last: De Neve Author-Name: Olivier Thas Author-X-Name-First: Olivier Author-X-Name-Last: Thas Title: A Regression Framework for Rank Tests Based on the Probabilistic Index Model Abstract: We demonstrate how many classical rank tests, such as the Wilcoxon-Mann-Whitney, Kruskal-Wallis, and Friedman test, can be embedded in a statistical modeling framework and how the method can be used to construct new rank tests. In addition to hypothesis testing, the method allows for estimating effect sizes with an informative interpretation, resulting in a better understanding of the data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1276-1283 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2015.1016226 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1016226 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1276-1283 Template-Type: ReDIF-Article 1.0 Author-Name: Tucker McElroy Author-X-Name-First: Tucker Author-X-Name-Last: McElroy Author-Name: Brian Monsell Author-X-Name-First: Brian Author-X-Name-Last: Monsell Title: Model Estimation, Prediction, and Signal Extraction for Nonstationary Stock and Flow Time Series Observed at Mixed Frequencies Abstract: An important practical problem for statistical agencies and central banks that publish economic data is the seasonal adjustment of mixed frequency stock and flow time series. This may arise in practice due to changes in funding of a particular survey. Mathematically, the problem can be reduced to the need to compute imputations, forecasts, and backcasts from a given model of the highest available frequency data. The nonstationarity of the economic time series coupled with the alteration of sampling frequency makes the problem of model estimation and imputation challenging. For flow data the analysis cannot be recast as a missing value problem, so that time series imputation methods are ineffective. We provide explicit formulas and algorithms that allow one to compute the log Gaussian likelihood of the mixed sample, as well as any imputations and forecasts. Formulas for the relevant mean squared error are also derived. We evaluate the methodology through simulations, and illustrate the techniques on some economic time series. Journal: Journal of the American Statistical Association Pages: 1284-1303 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2014.978452 File-URL: http://hdl.handle.net/10.1080/01621459.2014.978452 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1284-1303 Template-Type: ReDIF-Article 1.0 Author-Name: Graeme Blair Author-X-Name-First: Graeme Author-X-Name-Last: Blair Author-Name: Kosuke Imai Author-X-Name-First: Kosuke Author-X-Name-Last: Imai Author-Name: Yang-Yang Zhou Author-X-Name-First: Yang-Yang Author-X-Name-Last: Zhou Title: Design and Analysis of the Randomized Response Technique Abstract: About a half century ago, in 1965, Warner proposed the randomized response method as a survey technique to reduce potential bias due to nonresponse and social desirability when asking questions about sensitive behaviors and beliefs. This method asks respondents to use a randomization device, such as a coin flip, whose outcome is unobserved by the interviewer. By introducing random noise, the method conceals individual responses and protects respondent privacy. While numerous methodological advances have been made, we find surprisingly few applications of this promising survey technique. In this article, we address this gap by (1) reviewing standard designs available to applied researchers, (2) developing various multivariate regression techniques for substantive analyses, (3) proposing power analyses to help improve research designs, (4) presenting new robust designs that are based on less stringent assumptions than those of the standard designs, and (5) making all described methods available through open-source software. We illustrate some of these methods with an original survey about militant groups in Nigeria. Journal: Journal of the American Statistical Association Pages: 1304-1319 Issue: 511 Volume: 110 Year: 2015 Month: 9 X-DOI: 10.1080/01621459.2015.1050028 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1050028 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1304-1319 Template-Type: ReDIF-Article 1.0 Author-Name: David Morganstein Author-X-Name-First: David Author-X-Name-Last: Morganstein Title: Statistics: Making Better Decisions Journal: Journal of the American Statistical Association Pages: 1325-1330 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1106790 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1106790 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1325-1330 Template-Type: ReDIF-Article 1.0 Author-Name: Joshua D. Angrist Author-X-Name-First: Joshua D. Author-X-Name-Last: Angrist Author-Name: Miikka Rokkanen Author-X-Name-First: Miikka Author-X-Name-Last: Rokkanen Title: Wanna Get Away? Regression Discontinuity Estimation of Exam School Effects Away From the Cutoff Abstract: In regression discontinuity (RD) studies exploiting an award or admissions cutoff, causal effects are nonparametrically identified for those near the cutoff. The effect of treatment on inframarginal applicants is also of interest, but identification of such effects requires stronger assumptions than those required for identification at the cutoff. This article discusses RD identification and estimation away from the cutoff. Our identification strategy exploits the availability of dependent variable predictors other than the running variable. Conditional on these predictors, the running variable is assumed to be ignorable. This identification strategy is used to study effects of Boston exam schools for inframarginal applicants. Identification based on the conditional independence assumptions imposed in our framework yields reasonably precise and surprisingly robust estimates of the effects of exam school attendance on inframarginal applicants. These estimates suggest that the causal effects of exam school attendance for 9th grade applicants with running variable values well away from admissions cutoffs differ little from those for applicants with values that put them on the margin of acceptance. An extension to fuzzy designs is shown to identify causal effects for compliers away from the cutoff. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1331-1344 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1012259 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1012259 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1331-1344 Template-Type: ReDIF-Article 1.0 Author-Name: Michael G. Hudgens Author-X-Name-First: Michael G. Author-X-Name-Last: Hudgens Title: Comment Journal: Journal of the American Statistical Association Pages: 1345-1347 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1033058 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1033058 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1345-1347 Template-Type: ReDIF-Article 1.0 Author-Name: Thomas Lemieux Author-X-Name-First: Thomas Author-X-Name-Last: Lemieux Title: Comment Journal: Journal of the American Statistical Association Pages: 1347-1348 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1054490 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1054490 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1347-1348 Template-Type: ReDIF-Article 1.0 Author-Name: Joshua D. Angrist Author-X-Name-First: Joshua D. Author-X-Name-Last: Angrist Author-Name: Miikka Rokkanen Author-X-Name-First: Miikka Author-X-Name-Last: Rokkanen Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 1348-1349 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1106189 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1106189 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1348-1349 Template-Type: ReDIF-Article 1.0 Author-Name: Bo Jiang Author-X-Name-First: Bo Author-X-Name-Last: Jiang Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Title: Bayesian Partition Models for Identifying Expression Quantitative Trait Loci Abstract: Expression quantitative trait loci (eQTLs) are genomic locations associated with changes of expression levels of certain genes. By assaying gene expressions and genetic variations simultaneously on a genome-wide scale, scientists wish to discover genomic loci responsible for expression variations of a set of genes. The task can be viewed as a multivariate regression problem with variable selection on both responses (gene expression) and covariates (genetic variations), including also multi-way interactions among covariates. Instead of learning a predictive model of quantitative trait given combinations of genetic markers, we adopt an inverse modeling perspective to model the distribution of genetic markers conditional on gene expression traits. A particular strength of our method is its ability to detect interactive effects of genetic variations with high power even when their marginal effects are weak, addressing a key weakness of many existing eQTL mapping methods. Furthermore, we introduce a hierarchical model to capture the dependence structure among correlated genes. Through simulation studies and a real data example in yeast, we demonstrate how our Bayesian hierarchical partition model achieves a significantly improved power in detecting eQTLs compared to existing methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1350-1361 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1049746 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1049746 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1350-1361 Template-Type: ReDIF-Article 1.0 Author-Name: Liangliang Wang Author-X-Name-First: Liangliang Author-X-Name-Last: Wang Author-Name: Alexandre Bouchard-Côté Author-X-Name-First: Alexandre Author-X-Name-Last: Bouchard-Côté Author-Name: Arnaud Doucet Author-X-Name-First: Arnaud Author-X-Name-Last: Doucet Title: Bayesian Phylogenetic Inference Using a Combinatorial Sequential Monte Carlo Method Abstract: The application of Bayesian methods to large-scale phylogenetics problems is increasingly limited by computational issues, motivating the development of methods that can complement existing Markov chain Monte Carlo (MCMC) schemes. Sequential Monte Carlo (SMC) methods are approximate inference algorithms that have become very popular for time series models. Such methods have been recently developed to address phylogenetic inference problems but currently available techniques are only applicable to a restricted class of phylogenetic tree models compared to MCMC. In this article, we propose an original combinatorial SMC (CSMC) method to approximate posterior phylogenetic tree distributions, which is applicable to a general class of models and can be easily combined with MCMC to infer evolutionary parameters. Our method only relies on the existence of a flexible partially ordered set structure and is more generally applicable to sampling problems on combinatorial spaces. We demonstrate that the proposed CSMC algorithm provides consistent estimates under weak assumptions, is computationally fast, and is additionally easily parallelizable. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1362-1374 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1054487 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1054487 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1362-1374 Template-Type: ReDIF-Article 1.0 Author-Name: Jian Zhang Author-X-Name-First: Jian Author-X-Name-Last: Zhang Author-Name: Li Su Author-X-Name-First: Li Author-X-Name-Last: Su Title: Temporal Autocorrelation-Based Beamforming With MEG Neuroimaging Data Abstract: Characterizing the brain source activity using magnetoencephalography (MEG) requires solving an ill-posed inverse problem. Most source reconstruction procedures are performed in terms of power comparison. However, in the presence of voxel-specific noises, the direct power analysis can be misleading due to the power distortion as suggested by our multiple trial MEG study on a face-perception experiment. To tackle the issue, we propose a temporal autocorrelation-based method for the above analysis. The new method improves the face-perception analysis and identifies several differences between neuronal responses to face and scrambled-face stimuli. By the simulated and real data analyses, we demonstrate that compared to the existing methods, the new proposal can be more robust to voxel-specific noises without compromising on its accuracy in source localization. We further establish the consistency for estimating the proposed index when the number of sensors and the number of time instants are sufficiently large. In particular, we show that the proposed procedure can make a better focus on true sources than its precedents in terms of peak segregation coefficient. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1375-1388 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1054488 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1054488 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1375-1388 Template-Type: ReDIF-Article 1.0 Author-Name: Paul R. Rosenbaum Author-X-Name-First: Paul R. Author-X-Name-Last: Rosenbaum Title: Some Counterclaims Undermine Themselves in Observational Studies Abstract: Claims based on observational studies that a treatment has certain effects are often met with counterclaims asserting that the treatment is without effect, that associations are produced by biased treatment assignment. Some counterclaims undermine themselves in the following specific sense: presuming the counterclaim to be true may strengthen the support that the original data provide for the original claim, so that the counterclaim fails in its role as a critique of the original claim. In mathematics, a proof by contradiction supposes a proposition to be true en route to proving that the proposition is false. Analogously, the supposition that a particular counterclaim is true may justify an otherwise unjustified statistical analysis, and this added analysis may interpret the original data as providing even stronger support for the original claim. More precisely, the original study is sensitive to unmeasured biases of a particular magnitude, but an analysis that supposes the counterclaim to be true may be insensitive to much larger unmeasured biases. The issues are illustrated using data from the U.S. Fatal Accident Reporting System. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1389-1398 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1054489 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1054489 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1389-1398 Template-Type: ReDIF-Article 1.0 Author-Name: G. O. Mohler Author-X-Name-First: G. O. Author-X-Name-Last: Mohler Author-Name: M. B. Short Author-X-Name-First: M. B. Author-X-Name-Last: Short Author-Name: Sean Malinowski Author-X-Name-First: Sean Author-X-Name-Last: Malinowski Author-Name: Mark Johnson Author-X-Name-First: Mark Author-X-Name-Last: Johnson Author-Name: G. E. Tita Author-X-Name-First: G. E. Author-X-Name-Last: Tita Author-Name: Andrea L. Bertozzi Author-X-Name-First: Andrea L. Author-X-Name-Last: Bertozzi Author-Name: P. J. Brantingham Author-X-Name-First: P. J. Author-X-Name-Last: Brantingham Title: Randomized Controlled Field Trials of Predictive Policing Abstract: The concentration of police resources in stable crime hotspots has proven effective in reducing crime, but the extent to which police can disrupt dynamically changing crime hotspots is unknown. Police must be able to anticipate the future location of dynamic hotspots to disrupt them. Here we report results of two randomized controlled trials of near real-time epidemic-type aftershock sequence (ETAS) crime forecasting, one trial within three divisions of the Los Angeles Police Department and the other trial within two divisions of the Kent Police Department (United Kingdom). We investigate the extent to which (i) ETAS models of short-term crime risk outperform existing best practice of hotspot maps produced by dedicated crime analysts, (ii) police officers in the field can dynamically patrol predicted hotspots given limited resources, and (iii) crime can be reduced by predictive policing algorithms under realistic law enforcement resource constraints. While previous hotspot policing experiments fix treatment and control hotspots throughout the experimental period, we use a novel experimental design to allow treatment and control hotspots to change dynamically over the course of the experiment. Our results show that ETAS models predict 1.4--2.2 times as much crime compared to a dedicated crime analyst using existing criminal intelligence and hotspot mapping practice. Police patrols using ETAS forecasts led to an average 7.4% reduction in crime volume as a function of patrol time, whereas patrols based upon analyst predictions showed no significant effect. Dynamic police patrol in response to ETAS crime forecasts can disrupt opportunities for crime and lead to real crime reductions. Journal: Journal of the American Statistical Association Pages: 1399-1411 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1077710 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1077710 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1399-1411 Template-Type: ReDIF-Article 1.0 Author-Name: Kari Lock Morgan Author-X-Name-First: Kari Lock Author-X-Name-Last: Morgan Author-Name: Donald B. Rubin Author-X-Name-First: Donald B. Author-X-Name-Last: Rubin Title: Rerandomization to Balance Tiers of Covariates Abstract: When conducting a randomized experiment, if an allocation yields treatment groups that differ meaningfully with respect to relevant covariates, groups should be rerandomized. The process involves specifying an explicit criterion for whether an allocation is acceptable, based on a measure of covariate balance, and rerandomizing units until an acceptable allocation is obtained. Here, we illustrate how rerandomization could have improved the design of an already conducted randomized experiment on vocabulary and mathematics training programs, then provide a rerandomization procedure for covariates that vary in importance, and finally offer other extensions for rerandomization, including methods addressing computational efficiency. When covariates vary in a priori importance, better balance should be required for more important covariates. Rerandomization based on Mahalanobis distance preserves the joint distribution of covariates, but balances all covariates equally. Here, we propose rerandomizing based on Mahalanobis distance within tiers of covariate importance. Because balancing covariates in one tier will in general also partially balance covariates in other tiers, for each subsequent tier we explicitly balance only the components orthogonal to covariates in more important tiers. Journal: Journal of the American Statistical Association Pages: 1412-1421 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1079528 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1079528 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1412-1421 Template-Type: ReDIF-Article 1.0 Author-Name: Ian W. McKeague Author-X-Name-First: Ian W. Author-X-Name-Last: McKeague Author-Name: Min Qian Author-X-Name-First: Min Author-X-Name-Last: Qian Title: An Adaptive Resampling Test for Detecting the Presence of Significant Predictors Abstract: This article investigates marginal screening for detecting the presence of significant predictors in high-dimensional regression. Screening large numbers of predictors is a challenging problem due to the nonstandard limiting behavior of post-model-selected estimators. There is a common misconception that the oracle property for such estimators is a panacea, but the oracle property only holds away from the null hypothesis of interest in marginal screening. To address this difficulty, we propose an adaptive resampling test (ART). Our approach provides an alternative to the popular (yet conservative) Bonferroni method of controlling family-wise error rates. ART is adaptive in the sense that thresholding is used to decide whether the centered percentile bootstrap applies, and otherwise adapts to the nonstandard asymptotics in the tightest way possible. The performance of the approach is evaluated using a simulation study and applied to gene expression data and HIV drug resistance data. Journal: Journal of the American Statistical Association Pages: 1422-1433 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1095099 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1095099 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1422-1433 Template-Type: ReDIF-Article 1.0 Author-Name: A. Chatterjee Author-X-Name-First: A. Author-X-Name-Last: Chatterjee Author-Name: S. N. Lahiri Author-X-Name-First: S. N. Author-X-Name-Last: Lahiri Title: Comment Journal: Journal of the American Statistical Association Pages: 1434-1438 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1102143 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1102143 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1434-1438 Template-Type: ReDIF-Article 1.0 Author-Name: Rajen D. Shah Author-X-Name-First: Rajen D. Author-X-Name-Last: Shah Author-Name: Richard J. Samworth Author-X-Name-First: Richard J. Author-X-Name-Last: Samworth Title: Comment Journal: Journal of the American Statistical Association Pages: 1439-1442 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1102142 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1102142 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1439-1442 Template-Type: ReDIF-Article 1.0 Author-Name: Emre Barut Author-X-Name-First: Emre Author-X-Name-Last: Barut Author-Name: Huixia Judy Wang Author-X-Name-First: Huixia Judy Author-X-Name-Last: Wang Title: Comment Journal: Journal of the American Statistical Association Pages: 1442-1445 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1100619 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1100619 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1442-1445 Template-Type: ReDIF-Article 1.0 Author-Name: Lawrence D. Brown Author-X-Name-First: Lawrence D. Author-X-Name-Last: Brown Author-Name: Daniel McCarthy Author-X-Name-First: Daniel Author-X-Name-Last: McCarthy Title: Comment Journal: Journal of the American Statistical Association Pages: 1446-1449 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1099536 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1099536 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1446-1449 Template-Type: ReDIF-Article 1.0 Author-Name: Alexandre Belloni Author-X-Name-First: Alexandre Author-X-Name-Last: Belloni Author-Name: Victor Chernozhukov Author-X-Name-First: Victor Author-X-Name-Last: Chernozhukov Title: Comment Journal: Journal of the American Statistical Association Pages: 1449-1451 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1098545 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1098545 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1449-1451 Template-Type: ReDIF-Article 1.0 Author-Name: Yichi Zhang Author-X-Name-First: Yichi Author-X-Name-Last: Zhang Author-Name: Eric B. Laber Author-X-Name-First: Eric B. Author-X-Name-Last: Laber Title: Comment Journal: Journal of the American Statistical Association Pages: 1451-1454 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1106403 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1106403 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1451-1454 Template-Type: ReDIF-Article 1.0 Author-Name: Sai Li Author-X-Name-First: Sai Author-X-Name-Last: Li Author-Name: Ritwik Mitra Author-X-Name-First: Ritwik Author-X-Name-Last: Mitra Author-Name: Cun-Hui Zhang Author-X-Name-First: Cun-Hui Author-X-Name-Last: Zhang Title: Comment Journal: Journal of the American Statistical Association Pages: 1455-1456 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1106404 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1106404 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1455-1456 Template-Type: ReDIF-Article 1.0 Author-Name: Hannes Leeb Author-X-Name-First: Hannes Author-X-Name-Last: Leeb Title: Comment Journal: Journal of the American Statistical Association Pages: 1457-1459 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1109516 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1109516 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1457-1459 Template-Type: ReDIF-Article 1.0 Author-Name: Ian W. McKeague Author-X-Name-First: Ian W. Author-X-Name-Last: McKeague Author-Name: Min Qian Author-X-Name-First: Min Author-X-Name-Last: Qian Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 1459-1462 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1107431 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1107431 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1459-1462 Template-Type: ReDIF-Article 1.0 Author-Name: Stephen S. M. Lee Author-X-Name-First: Stephen S. M. Author-X-Name-Last: Lee Author-Name: Mehdi Soleymani Author-X-Name-First: Mehdi Author-X-Name-Last: Soleymani Title: A Simple Formula for Mixing Estimators With Different Convergence Rates Abstract: Suppose that two estimators, and , are available for estimating an unknown parameter θ, and are known to have convergence rates n-super-1/2 and rn = o(n-super-1/2), respectively, based on a sample of size n. Typically, the more efficient estimator is less robust than , and a definitive choice cannot be easily made between them under practical circumstances. We propose a simple mixture estimator, in the form of a linear combination of and , which successfully reaps the benefits of both estimators. We prove that the mixture estimator possesses a kind of oracle property so that it captures the fast n-super-1/2 convergence rate of when conditions are favorable, and is at least rn-consistent otherwise. Applications of the mixture estimator are illustrated with examples drawn from different problem settings including orthogonal function regression, local polynomial regression, density derivative estimation, and bootstrap inferences for possibly dependent data. Journal: Journal of the American Statistical Association Pages: 1463-1478 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.960966 File-URL: http://hdl.handle.net/10.1080/01621459.2014.960966 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1463-1478 Template-Type: ReDIF-Article 1.0 Author-Name: Anirban Bhattacharya Author-X-Name-First: Anirban Author-X-Name-Last: Bhattacharya Author-Name: Debdeep Pati Author-X-Name-First: Debdeep Author-X-Name-Last: Pati Author-Name: Natesh S. Pillai Author-X-Name-First: Natesh S. Author-X-Name-Last: Pillai Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Dirichlet--Laplace Priors for Optimal Shrinkage Abstract: Penalized regression methods, such as L1 regularization, are routinely used in high-dimensional applications, and there is a rich literature on optimality properties under sparsity assumptions. In the Bayesian paradigm, sparsity is routinely induced through two-component mixture priors having a probability mass at zero, but such priors encounter daunting computational problems in high dimensions. This has motivated continuous shrinkage priors, which can be expressed as global-local scale mixtures of Gaussians, facilitating computation. In contrast to the frequentist literature, little is known about the properties of such priors and the convergence and concentration of the corresponding posterior distribution. In this article, we propose a new class of Dirichlet--Laplace priors, which possess optimal posterior concentration and lead to efficient posterior computation. Finite sample performance of Dirichlet--Laplace priors relative to alternatives is assessed in simulated and real data examples. Journal: Journal of the American Statistical Association Pages: 1479-1490 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.960967 File-URL: http://hdl.handle.net/10.1080/01621459.2014.960967 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1479-1490 Template-Type: ReDIF-Article 1.0 Author-Name: Weizhen Wang Author-X-Name-First: Weizhen Author-X-Name-Last: Wang Title: Exact Optimal Confidence Intervals for Hypergeometric Parameters Abstract: For a hypergeometric distribution, denoted by , where N is the population size, M is the number of population units with some attribute, and n is the given sample size, there are two parametric cases: (i) N is unknown and M is given; (ii) M is unknown and N is given. For each case, we first show that the minimum coverage probability of commonly used approximate intervals is much smaller than the nominal level for any n, then we provide exact smallest lower and upper one-sided confidence intervals and an exact admissible two-sided confidence interval, a complete set of solutions, for each parameter. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1491-1499 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.966191 File-URL: http://hdl.handle.net/10.1080/01621459.2014.966191 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1491-1499 Template-Type: ReDIF-Article 1.0 Author-Name: Rajarshi Guhaniyogi Author-X-Name-First: Rajarshi Author-X-Name-Last: Guhaniyogi Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Bayesian Compressed Regression Abstract: As an alternative to variable selection or shrinkage in high-dimensional regression, we propose to randomly compress the predictors prior to analysis. This dramatically reduces storage and computational bottlenecks, performing well when the predictors can be projected to a low-dimensional linear subspace with minimal loss of information about the response. As opposed to existing Bayesian dimensionality reduction approaches, the exact posterior distribution conditional on the compressed data is available analytically, speeding up computation by many orders of magnitude while also bypassing robustness issues due to convergence and mixing problems with MCMC. Model averaging is used to reduce sensitivity to the random projection matrix, while accommodating uncertainty in the subspace dimension. Strong theoretical support is provided for the approach by showing near parametric convergence rates for the predictive density in the large p small n asymptotic paradigm. Practical performance relative to competitors is illustrated in simulations and real data applications. Journal: Journal of the American Statistical Association Pages: 1500-1514 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.969425 File-URL: http://hdl.handle.net/10.1080/01621459.2014.969425 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1500-1514 Template-Type: ReDIF-Article 1.0 Author-Name: Zifang Guo Author-X-Name-First: Zifang Author-X-Name-Last: Guo Author-Name: Lexin Li Author-X-Name-First: Lexin Author-X-Name-Last: Li Author-Name: Wenbin Lu Author-X-Name-First: Wenbin Author-X-Name-Last: Lu Author-Name: Bing Li Author-X-Name-First: Bing Author-X-Name-Last: Li Title: Groupwise Dimension Reduction via Envelope Method Abstract: The family of sufficient dimension reduction (SDR) methods that produce informative combinations of predictors, or indices, are particularly useful for high-dimensional regression analysis. In many such analyses, it becomes increasingly common that there is available a priori subject knowledge of the predictors; for example, they belong to different groups. While many recent SDR proposals have greatly expanded the scope of the methods’ applicability, how to effectively incorporate the prior predictor structure information remains a challenge. In this article, we aim at dimension reduction that recovers full regression information while preserving the predictor group structure. Built upon a new concept of the direct sum envelope, we introduce a systematic way to incorporate the group information in most existing SDR estimators. As a result, the reduction outcomes are much easier to interpret. Moreover, the envelope method provides a principled way to build a variety of prior structures into dimension reduction analysis. Both simulations and real data analysis demonstrate the competent numerical performance of the new method. Journal: Journal of the American Statistical Association Pages: 1515-1527 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.970687 File-URL: http://hdl.handle.net/10.1080/01621459.2014.970687 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1515-1527 Template-Type: ReDIF-Article 1.0 Author-Name: Antonio F. Galvao Author-X-Name-First: Antonio F. Author-X-Name-Last: Galvao Author-Name: Liang Wang Author-X-Name-First: Liang Author-X-Name-Last: Wang Title: Uniformly Semiparametric Efficient Estimation of Treatment Effects With a Continuous Treatment Abstract: This article studies identification, estimation, and inference of general unconditional treatment effects models with continuous treatment under the ignorability assumption. We show identification of the parameters of interest, the dose--response functions, under the assumption that selection to treatment is based on observables. We propose a semiparametric two-step estimator, and consider estimation of the dose--response functions through moment restriction models with generalized residual functions that are possibly nonsmooth. This general formulation includes average and quantile treatment effects as special cases. The asymptotic properties of the estimator are derived, namely, uniform consistency, weak convergence, and semiparametric efficiency. We also develop statistical inference procedures and establish the validity of a bootstrap approach to implement these methods in practice. Monte Carlo simulations show that the proposed methods have good finite sample properties. Finally, we apply the proposed methods to estimate the unconditional average and quantile effects of mothers’ weight gain and age on birthweight. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1528-1542 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.978005 File-URL: http://hdl.handle.net/10.1080/01621459.2014.978005 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1528-1542 Template-Type: ReDIF-Article 1.0 Author-Name: J. Marcus Jobe Author-X-Name-First: J. Marcus Author-X-Name-Last: Jobe Author-Name: Michael Pokojovy Author-X-Name-First: Michael Author-X-Name-Last: Pokojovy Title: A Cluster-Based Outlier Detection Scheme for Multivariate Data Abstract: Detection power of the squared Mahalanobis distance statistic is significantly reduced when several outliers exist within a multivariate dataset of interest. To overcome this masking effect, we propose a computer-intensive cluster-based approach that incorporates a reweighted version of Rousseeuw’s minimum covariance determinant method with a multi-step cluster-based algorithm that initially filters out potential masking points. Compared to the most robust procedures, simulation studies show that our new method is better for outlier detection. Additional real data comparisons are given. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1543-1551 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.983231 File-URL: http://hdl.handle.net/10.1080/01621459.2014.983231 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1543-1551 Template-Type: ReDIF-Article 1.0 Author-Name: Ryan Martin Author-X-Name-First: Ryan Author-X-Name-Last: Martin Title: Plausibility Functions and Exact Frequentist Inference Abstract: In the frequentist program, inferential methods with exact control on error rates are a primary focus. The standard approach, however, is to rely on asymptotic approximations, which may not be suitable. This article presents a general framework for the construction of exact frequentist procedures based on plausibility functions. It is shown that the plausibility function-based tests and confidence regions have the desired frequentist properties in finite samples—no large-sample justification needed. An extension of the proposed method is also given for problems involving nuisance parameters. Examples demonstrate that the plausibility function-based method is both exact and efficient in a wide variety of problems. Journal: Journal of the American Statistical Association Pages: 1552-1561 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.983232 File-URL: http://hdl.handle.net/10.1080/01621459.2014.983232 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1552-1561 Template-Type: ReDIF-Article 1.0 Author-Name: Jing Zhou Author-X-Name-First: Jing Author-X-Name-Last: Zhou Author-Name: Anirban Bhattacharya Author-X-Name-First: Anirban Author-X-Name-Last: Bhattacharya Author-Name: Amy H. Herring Author-X-Name-First: Amy H. Author-X-Name-Last: Herring Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Bayesian Factorizations of Big Sparse Tensors Abstract: It has become routine to collect data that are structured as multiway arrays (tensors). There is an enormous literature on low rank and sparse matrix factorizations, but limited consideration of extensions to the tensor case in statistics. The most common low rank tensor factorization relies on parallel factor analysis (PARAFAC), which expresses a rank k tensor as a sum of rank one tensors. In contingency table applications in which the sample size is massively less than the number of cells in the table, the low rank assumption is not sufficient and PARAFAC has poor performance. We induce an additional layer of dimension reduction by allowing the effective rank to vary across dimensions of the table. Taking a Bayesian approach, we place priors on terms in the factorization and develop an efficient Gibbs sampler for posterior computation. Theory is provided showing posterior concentration rates in high-dimensional settings, and the methods are shown to have excellent performance in simulations and several real data applications. Journal: Journal of the American Statistical Association Pages: 1562-1576 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.983233 File-URL: http://hdl.handle.net/10.1080/01621459.2014.983233 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1562-1576 Template-Type: ReDIF-Article 1.0 Author-Name: Jiwei Zhao Author-X-Name-First: Jiwei Author-X-Name-Last: Zhao Author-Name: Jun Shao Author-X-Name-First: Jun Author-X-Name-Last: Shao Title: Semiparametric Pseudo-Likelihoods in Generalized Linear Models With Nonignorable Missing Data Abstract: We consider identifiability and estimation in a generalized linear model in which the response variable and some covariates have missing values and the missing data mechanism is nonignorable and unspecified. We adopt a pseudo-likelihood approach that makes use of an instrumental variable to help identifying unknown parameters in the presence of nonignorable missing data. Explicit conditions for the identifiability of parameters are given. Some asymptotic properties of the parameter estimators based on maximizing the pseudo-likelihood are established. Explicit asymptotic covariance matrix and its estimator are also derived in some cases. For the numerical maximization of the pseudo-likelihood, we develop a two-step iteration algorithm that decomposes a nonconcave maximization problem into two problems of maximizing concave functions. Some simulation results and an application to a dataset from cotton factory workers are also presented. Journal: Journal of the American Statistical Association Pages: 1577-1590 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.983234 File-URL: http://hdl.handle.net/10.1080/01621459.2014.983234 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1577-1590 Template-Type: ReDIF-Article 1.0 Author-Name: Laurent E. Calvet Author-X-Name-First: Laurent E. Author-X-Name-Last: Calvet Author-Name: Veronika Czellar Author-X-Name-First: Veronika Author-X-Name-Last: Czellar Author-Name: Elvezio Ronchetti Author-X-Name-First: Elvezio Author-X-Name-Last: Ronchetti Title: Robust Filtering Abstract: Filtering methods are powerful tools to estimate the hidden state of a state-space model from observations available in real time. However, they are known to be highly sensitive to the presence of small misspecifications of the underlying model and to outliers in the observation process. In this article, we show that the methodology of robust statistics can be adapted to sequential filtering. We define a filter as being robust if the relative error in the state distribution caused by misspecifications is uniformly bounded by a linear function of the perturbation size. Since standard filters are nonrobust even in the simplest cases, we propose robustified filters which provide accurate state inference in the presence of model misspecifications. The robust particle filter naturally mitigates the degeneracy problems that plague the bootstrap particle filler (Gordon, Salmond, and Smith) and its many extensions. We illustrate the good properties of robust filters in linear and nonlinear state-space examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1591-1606 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.983520 File-URL: http://hdl.handle.net/10.1080/01621459.2014.983520 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1591-1606 Template-Type: ReDIF-Article 1.0 Author-Name: Qifan Song Author-X-Name-First: Qifan Author-X-Name-Last: Song Author-Name: Faming Liang Author-X-Name-First: Faming Author-X-Name-Last: Liang Title: High-Dimensional Variable Selection With Reciprocal L1-Regularization Abstract: During the past decade, penalized likelihood methods have been widely used in variable selection problems, where the penalty functions are typically symmetric about 0, continuous and nondecreasing in (0, ∞). We propose a new penalized likelihood method, reciprocal Lasso (or in short, rLasso), based on a new class of penalty functions that are decreasing in (0, ∞), discontinuous at 0, and converge to infinity when the coefficients approach zero. The new penalty functions give nearly zero coefficients infinity penalties; in contrast, the conventional penalty functions give nearly zero coefficients nearly zero penalties (e.g., Lasso and smoothly clipped absolute deviation [SCAD]) or constant penalties (e.g., L0 penalty). This distinguishing feature makes rLasso very attractive for variable selection. It can effectively avoid to select overly dense models. We establish the consistency of the rLasso for variable selection and coefficient estimation under both the low- and high-dimensional settings. Since the rLasso penalty functions induce an objective function with multiple local minima, we also propose an efficient Monte Carlo optimization algorithm to solve the involved minimization problem. Our simulation results show that the rLasso outperforms other popular penalized likelihood methods, such as Lasso, SCAD, minimax concave penalty, sure independence screening, interative sure independence screening, and extended Bayesian information criterion. It can produce sparser and more accurate coefficient estimates, and catch the true model with a higher probability. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1607-1620 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.984812 File-URL: http://hdl.handle.net/10.1080/01621459.2014.984812 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1607-1620 Template-Type: ReDIF-Article 1.0 Author-Name: Ryan Martin Author-X-Name-First: Ryan Author-X-Name-Last: Martin Author-Name: Chuanhai Liu Author-X-Name-First: Chuanhai Author-X-Name-Last: Liu Title: Marginal Inferential Models: Prior-Free Probabilistic Inference on Interest Parameters Abstract: The inferential models (IM) framework provides prior-free, frequency-calibrated, and posterior probabilistic inference. The key is the use of random sets to predict unobservable auxiliary variables connected to the observable data and unknown parameters. When nuisance parameters are present, a marginalization step can reduce the dimension of the auxiliary variable which, in turn, leads to more efficient inference. For regular problems, exact marginalization can be achieved, and we give conditions for marginal IM validity. We show that our approach provides exact and efficient marginal inference in several challenging problems, including a many-normal-means problem. In nonregular problems, we propose a generalized marginalization technique and prove its validity. Details are given for two benchmark examples, namely, the Behrens--Fisher and gamma mean problems. Journal: Journal of the American Statistical Association Pages: 1621-1631 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.985827 File-URL: http://hdl.handle.net/10.1080/01621459.2014.985827 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1621-1631 Template-Type: ReDIF-Article 1.0 Author-Name: Hiroyuki Kasahara Author-X-Name-First: Hiroyuki Author-X-Name-Last: Kasahara Author-Name: Katsumi Shimotsu Author-X-Name-First: Katsumi Author-X-Name-Last: Shimotsu Title: Testing the Number of Components in Normal Mixture Regression Models Abstract: Testing the number of components in finite normal mixture models is a long-standing challenge because of its nonregularity. This article studies likelihood-based testing of the number of components in normal mixture regression models with heteroscedastic components. We construct a likelihood-based test of the null hypothesis of m0 components against the alternative hypothesis of m0 + 1 components for any m0. The null asymptotic distribution of the proposed modified EM test statistic is the maximum of m0 random variables that can be easily simulated. The simulations show that the proposed test has very good finite sample size and power properties. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1632-1645 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.986272 File-URL: http://hdl.handle.net/10.1080/01621459.2014.986272 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1632-1645 Template-Type: ReDIF-Article 1.0 Author-Name: Daniel K. Sewell Author-X-Name-First: Daniel K. Author-X-Name-Last: Sewell Author-Name: Yuguo Chen Author-X-Name-First: Yuguo Author-X-Name-Last: Chen Title: Latent Space Models for Dynamic Networks Abstract: Dynamic networks are used in a variety of fields to represent the structure and evolution of the relationships between entities. We present a model which embeds longitudinal network data as trajectories in a latent Euclidean space. We propose Markov chain Monte Carlo (MCMC) algorithm to estimate the model parameters and latent positions of the actors in the network. The model yields meaningful visualization of dynamic networks, giving the researcher insight into the evolution and the structure, both local and global, of the network. The model handles directed or undirected edges, easily handles missing edges, and lends itself well to predicting future edges. Further, a novel approach is given to detect and visualize an attracting influence between actors using only the edge information. We use the case-control likelihood approximation to speed up the estimation algorithm, modifying it slightly to account for missing data. We apply the latent space model to data collected from a Dutch classroom, and a cosponsorship network collected on members of the U.S. House of Representatives, illustrating the usefulness of the model by making insights into the networks. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1646-1657 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.988214 File-URL: http://hdl.handle.net/10.1080/01621459.2014.988214 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1646-1657 Template-Type: ReDIF-Article 1.0 Author-Name: Lan Wang Author-X-Name-First: Lan Author-X-Name-Last: Wang Author-Name: Bo Peng Author-X-Name-First: Bo Author-X-Name-Last: Peng Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Title: A High-Dimensional Nonparametric Multivariate Test for Mean Vector Abstract: This work is concerned with testing the population mean vector of nonnormal high-dimensional multivariate data. Several tests for high-dimensional mean vector, based on modifying the classical Hotelling T-super-2 test, have been proposed in the literature. Despite their usefulness, they tend to have unsatisfactory power performance for heavy-tailed multivariate data, which frequently arise in genomics and quantitative finance. This article proposes a novel high-dimensional nonparametric test for the population mean vector for a general class of multivariate distributions. With the aid of new tools in modern probability theory, we proved that the limiting null distribution of the proposed test is normal under mild conditions when p is substantially larger than n. We further study the local power of the proposed test and compare its relative efficiency with a modified Hotelling T-super-2 test for high-dimensional data. An interesting finding is that the newly proposed test can have even more substantial power gain with large p than the traditional nonparametric multivariate test does with finite fixed p. We study the finite sample performance of the proposed test via Monte Carlo simulations. We further illustrate its application by an empirical analysis of a genomics dataset. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1658-1669 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.988215 File-URL: http://hdl.handle.net/10.1080/01621459.2014.988215 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1658-1669 Template-Type: ReDIF-Article 1.0 Author-Name: Yuanshan Wu Author-X-Name-First: Yuanshan Author-X-Name-Last: Wu Author-Name: Yanyuan Ma Author-X-Name-First: Yanyuan Author-X-Name-Last: Ma Author-Name: Guosheng Yin Author-X-Name-First: Guosheng Author-X-Name-Last: Yin Title: Smoothed and Corrected Score Approach to Censored Quantile Regression With Measurement Errors Abstract: Censored quantile regression is an important alternative to the Cox proportional hazards model in survival analysis. In contrast to the usual central covariate effects, quantile regression can effectively characterize the covariate effects at different quantiles of the survival time. When covariates are measured with errors, it is known that naively treating mismeasured covariates as error-free would result in estimation bias. Under censored quantile regression, we propose smoothed and corrected estimating equations to obtain consistent estimators. We establish consistency and asymptotic normality for the proposed estimators of quantile regression coefficients. Compared with the naive estimator, the proposed method can eliminate the estimation bias under various measurement error distributions and model error distributions. We conduct simulation studies to examine the finite-sample properties of the new method and apply it to a lung cancer study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1670-1683 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.989323 File-URL: http://hdl.handle.net/10.1080/01621459.2014.989323 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1670-1683 Template-Type: ReDIF-Article 1.0 Author-Name: Tyler H. McCormick Author-X-Name-First: Tyler H. Author-X-Name-Last: McCormick Author-Name: Tian Zheng Author-X-Name-First: Tian Author-X-Name-Last: Zheng Title: Latent Surface Models for Networks Using Aggregated Relational Data Abstract: Despite increased interest across a range of scientific applications in modeling and understanding social network structure, collecting complete network data remains logistically and financially challenging, especially in the social sciences. This article introduces a latent surface representation of social network structure for partially observed network data. We derive a multivariate measure of expected (latent) distance between an observed actor and unobserved actors with given features. We also draw novel parallels between our work and dependent data in spatial and ecological statistics. We demonstrate the contribution of our model using a random digit-dial telephone survey and a multiyear prospective study of the relationship between network structure and the spread of infectious disease. The model proposed here is related to previous network models which represents high-dimensional structure through a projection to a low-dimensional latent geometric surface-encoding dependence as distance in the space. We develop a latent surface model for cases when complete network data are unavailable. We focus specifically on aggregated relational data (ARD) which measure network structure indirectly by asking respondents how many connections they have with members of a certain subpopulation (e.g., How many individuals do you know who are HIV positive?) and are easily added to existing surveys. Instead of conditioning on the (latent) distance between two members of the network, the latent surface model for ARD conditions on the expected distance between a survey respondent and the center of a subpopulation on a latent manifold surface. A spherical latent surface and angular distance across the sphere’s surface facilitate tractable computation of this expectation. This model estimates relative homogeneity between groups in the population and variation in the propensity for interaction between respondents and group members. The model also estimates features of groups which are difficult to reach using standard surveys (e.g., the homeless). Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1684-1695 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.991395 File-URL: http://hdl.handle.net/10.1080/01621459.2014.991395 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1684-1695 Template-Type: ReDIF-Article 1.0 Author-Name: Jin Xu Author-X-Name-First: Jin Author-X-Name-Last: Xu Author-Name: Jiajie Chen Author-X-Name-First: Jiajie Author-X-Name-Last: Chen Author-Name: Peter Z. G. Qian Author-X-Name-First: Peter Z. G. Author-X-Name-Last: Qian Title: Sequentially Refined Latin Hypercube Designs: Reusing Every Point Abstract: The use of iteratively enlarged Latin hypercube designs for running computer experiments has recently gained popularity in practice. This approach conducts an initial experiment with a computer code using a Latin hypercube design and then runs a follow-up experiment with additional runs elaborately chosen so that the combined design set for the two experiments forms a larger Latin hypercube design. This augmenting process can be repeated multiple stages, where in each stage the augmented design set is guaranteed to be a Latin hypercube design. We provide a theoretical framework to put this approach on a firm footing. Numerical examples are given to corroborate the derived theoretical results. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1696-1706 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.993078 File-URL: http://hdl.handle.net/10.1080/01621459.2014.993078 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1696-1706 Template-Type: ReDIF-Article 1.0 Author-Name: Noah Simon Author-X-Name-First: Noah Author-X-Name-Last: Simon Author-Name: Robert Tibshirani Author-X-Name-First: Robert Author-X-Name-Last: Tibshirani Title: A Permutation Approach to Testing Interactions for Binary Response by Comparing Correlations Between Classes Abstract: To date testing interactions in high dimensions is a challenging task. Existing methods often have issues with sensitivity to modeling assumptions and heavily asymptotic nominal p-values. To help alleviate these issues, we propose a permutation-based method for testing marginal interactions with a binary response. Our method searches for pairwise correlations that differ between classes. In this article, we compare our method on real and simulated data to the standard approach of running many pairwise logistic models. On simulated data our method finds more significant interactions at a lower false discovery rate (especially in the presence of main effects). On real genomic data, although there is no gold standard, our method finds apparent signal and tells a believable story, while logistic regression does not. We also give asymptotic consistency results under not too restrictive assumptions. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1707-1716 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.993079 File-URL: http://hdl.handle.net/10.1080/01621459.2014.993079 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1707-1716 Template-Type: ReDIF-Article 1.0 Author-Name: Wenxin Jiang Author-X-Name-First: Wenxin Author-X-Name-Last: Jiang Author-Name: Yu Zhao Author-X-Name-First: Yu Author-X-Name-Last: Zhao Title: On Asymptotic Distributions and Confidence Intervals for LIFT Measures in Data Mining Abstract: A LIFT measure, such as the response rate, lift, or the percentage of captured response, is a fundamental measure of effectiveness for a scoring rule obtained from data mining, which is estimated from a set of validation data. In this article, we study how to construct confidence intervals of the LIFT measures. We point out the subtlety of this task and explain how simple binomial confidence intervals can have incorrect coverage probabilities, due to omitting variation from the sample percentile of the scoring rule. We derive the asymptotic distribution using some advanced empirical process theory and the functional delta method in the Appendix. The additional variation is shown to be related to a conditional mean response, which can be estimated by a local averaging of the responses over the scores from the validation data. Alternatively, a subsampling method is shown to provide a valid confidence interval, without needing to estimate the conditional mean response. Numerical experiments are conducted to compare these different methods regarding the coverage probabilities and the lengths of the resulting confidence intervals. Journal: Journal of the American Statistical Association Pages: 1717-1725 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.993080 File-URL: http://hdl.handle.net/10.1080/01621459.2014.993080 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1717-1725 Template-Type: ReDIF-Article 1.0 Author-Name: Xueqin Wang Author-X-Name-First: Xueqin Author-X-Name-Last: Wang Author-Name: Wenliang Pan Author-X-Name-First: Wenliang Author-X-Name-Last: Pan Author-Name: Wenhao Hu Author-X-Name-First: Wenhao Author-X-Name-Last: Hu Author-Name: Yuan Tian Author-X-Name-First: Yuan Author-X-Name-Last: Tian Author-Name: Heping Zhang Author-X-Name-First: Heping Author-X-Name-Last: Zhang Title: Conditional Distance Correlation Abstract: Statistical inference on conditional dependence is essential in many fields including genetic association studies and graphical models. The classic measures focus on linear conditional correlations and are incapable of characterizing nonlinear conditional relationship including nonmonotonic relationship. To overcome this limitation, we introduce a nonparametric measure of conditional dependence for multivariate random variables with arbitrary dimensions. Our measure possesses the necessary and intuitive properties as a correlation index. Briefly, it is zero almost surely if and only if two multivariate random variables are conditionally independent given a third random variable. More importantly, the sample version of this measure can be expressed elegantly as the root of a V or U-process with random kernels and has desirable theoretical properties. Based on the sample version, we propose a test for conditional independence, which is proven to be more powerful than some recently developed tests through our numerical simulations. The advantage of our test is even greater when the relationship between the multivariate random variables given the third random variable cannot be expressed in a linear or monotonic function of one random variable versus the other. We also show that the sample measure is consistent and weakly convergent, and the test statistic is asymptotically normal. By applying our test in a real data analysis, we are able to identify two conditionally associated gene expressions, which otherwise cannot be revealed. Thus, our measure of conditional dependence is not only an ideal concept, but also has important practical utility. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1726-1734 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2014.993081 File-URL: http://hdl.handle.net/10.1080/01621459.2014.993081 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1726-1734 Template-Type: ReDIF-Article 1.0 Author-Name: Gauri Sankar Datta Author-X-Name-First: Gauri Sankar Author-X-Name-Last: Datta Author-Name: Abhyuday Mandal Author-X-Name-First: Abhyuday Author-X-Name-Last: Mandal Title: Small Area Estimation With Uncertain Random Effects Abstract: Random effects models play an important role in model-based small area estimation. Random effects account for any lack of fit of a regression model for the population means of small areas on a set of explanatory variables. In a recent article, Datta, Hall, and Mandal showed that if the random effects can be dispensed with via a suitable test, then the model parameters and the small area means may be estimated with substantially higher accuracy. The work of Datta, Hall, and Mandal is most useful when the number of small areas, m, is moderately large. For large m, the null hypothesis of no random effects will likely be rejected. Rejection of the null hypothesis is usually caused by a few large residuals signifying a departure of the direct estimator from the synthetic regression estimator. As a flexible alternative to the Fay--Herriot random effects model and the approach in Datta, Hall, and Mandal, in this article we consider a mixture model for random effects. It is reasonably expected that small areas with population means explained adequately by covariates have little model error, and the other areas with means not adequately explained by covariates will require a random component added to the regression model. This model is a useful alternative to the usual random effects model and the data determine the extent of lack of fit of the regression model for a particular small area, and include a random effect if needed. Unlike the Datta, Hall, and Mandal approach which recommends excluding random effects from all small areas if a test of null hypothesis of no random effects is not rejected, the present model is more flexible. We used this mixture model to estimate poverty ratios for 5--17-year-old-related children for the 50 U.S. states and Washington, DC. This application is motivated by the SAIPE project of the U.S. Census Bureau. We empirically evaluated the accuracy of the direct estimates and the estimates obtained from our mixture model and the Fay--Herriot random effects model. These empirical evaluations and a simulation study, in conjunction with a lower posterior variance of the new estimates, show that the new estimates are more accurate than both the frequentist and the Bayes estimates resulting from the standard Fay--Herriot model. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1735-1744 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1016526 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1016526 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1735-1744 Template-Type: ReDIF-Article 1.0 Author-Name: Brigham R. Frandsen Author-X-Name-First: Brigham R. Author-X-Name-Last: Frandsen Title: Treatment Effects With Censoring and Endogeneity Abstract: This article develops a nonparametric approach to identification and estimation of treatment effects on censored outcomes when treatment may be endogenous and have arbitrarily heterogenous effects. Identification is based on an instrumental variable that satisfies the exclusion and monotonicity conditions standard in the local average treatment effects framework. The article proposes a censored quantile treatment effects estimator, derives its asymptotic distribution, and illustrates its performance using Monte Carlo simulations. Even in the exogenous case, the estimator performs better in finite samples than existing censored quantile regression estimators, and performs nearly as well as maximum likelihood estimators in cases where their distributional assumptions hold. An empirical application to a subsidized job training program finds that participation significantly and dramatically reduced the duration of jobless spells, especially at the right tail of the distribution. Journal: Journal of the American Statistical Association Pages: 1745-1752 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1017577 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1017577 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1745-1752 Template-Type: ReDIF-Article 1.0 Author-Name: Sebastian Calonico Author-X-Name-First: Sebastian Author-X-Name-Last: Calonico Author-Name: Matias D. Cattaneo Author-X-Name-First: Matias D. Author-X-Name-Last: Cattaneo Author-Name: Rocío Titiunik Author-X-Name-First: Rocío Author-X-Name-Last: Titiunik Title: Optimal Data-Driven Regression Discontinuity Plots Abstract: Exploratory data analysis plays a central role in applied statistics and econometrics. In the popular regression-discontinuity (RD) design, the use of graphical analysis has been strongly advocated because it provides both easy presentation and transparent validation of the design. RD plots are nowadays widely used in applications, despite its formal properties being unknown: these plots are typically presented employing ad hoc choices of tuning parameters, which makes these procedures less automatic and more subjective. In this article, we formally study the most common RD plot based on an evenly spaced binning of the data, and propose several (optimal) data-driven choices for the number of bins depending on the goal of the researcher. These RD plots are constructed either to approximate the underlying unknown regression functions without imposing smoothness in the estimator, or to approximate the underlying variability of the raw data while smoothing out the otherwise uninformative scatterplot of the data. In addition, we introduce an alternative RD plot based on quantile spaced binning, study its formal properties, and propose similar (optimal) data-driven choices for the number of bins. The main proposed data-driven selectors employ spacings estimators, which are simple and easy to implement in applications because they do not require additional choices of tuning parameters. Altogether, our results offer an array of alternative RD plots that are objective and automatic when implemented, providing a reliable benchmark for graphical analysis in RD designs. We illustrate the performance of our automatic RD plots using several empirical examples and a Monte Carlo study. All results are readily available in R and STATA using the software packages described in Calonico, Cattaneo, and Titiunik. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1753-1769 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1017578 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1017578 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1753-1769 Template-Type: ReDIF-Article 1.0 Author-Name: Ruoqing Zhu Author-X-Name-First: Ruoqing Author-X-Name-Last: Zhu Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Michael R. Kosorok Author-X-Name-First: Michael R. Author-X-Name-Last: Kosorok Title: Reinforcement Learning Trees Abstract: In this article, we introduce a new type of tree-based method, reinforcement learning trees (RLT), which exhibits significantly improved performance over traditional methods such as random forests (Breiman 2001) under high-dimensional settings. The innovations are three-fold. First, the new method implements reinforcement learning at each selection of a splitting variable during the tree construction processes. By splitting on the variable that brings the greatest future improvement in later splits, rather than choosing the one with largest marginal effect from the immediate split, the constructed tree uses the available samples in a more efficient way. Moreover, such an approach enables linear combination cuts at little extra computational cost. Second, we propose a variable muting procedure that progressively eliminates noise variables during the construction of each individual tree. The muting procedure also takes advantage of reinforcement learning and prevents noise variables from being considered in the search for splitting rules, so that toward terminal nodes, where the sample size is small, the splitting rules are still constructed from only strong variables. Last, we investigate asymptotic properties of the proposed method under basic assumptions and discuss rationale in general settings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1770-1784 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1036994 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1036994 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1770-1784 Template-Type: ReDIF-Article 1.0 Author-Name: Aurore Delaigle Author-X-Name-First: Aurore Author-X-Name-Last: Delaigle Author-Name: Wen-Xin Zhou Author-X-Name-First: Wen-Xin Author-X-Name-Last: Zhou Title: Nonparametric and Parametric Estimators of Prevalence From Group Testing Data With Aggregated Covariates Abstract: Group testing is a technique employed in large screening studies involving infectious disease, where individuals in the study are grouped before being observed. Parametric and nonparametric estimators of conditional prevalence have been developed in the group testing literature, in the case where the binary variable indicating the disease status is available only for the group, but the explanatory variable is observed for each individual. However, for reasons such as the high cost of assays, the confidentiality of the patients, or the impossibility of measuring a concentration under a detection limit, the explanatory variable is observable only in an aggregated form and the existing techniques are no longer valid. We develop consistent parametric and nonparametric estimators of the conditional prevalence in this complex problem. We establish theoretical properties of our estimators and illustrate their practical performance on simulated and real data. We extend our techniques to the case where the group status is measured imperfectly, and to the setting where the covariate is aggregated and the individual status is available. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1785-1796 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1054491 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1054491 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1785-1796 Template-Type: ReDIF-Article 1.0 Author-Name: Xiaofeng Shao Author-X-Name-First: Xiaofeng Author-X-Name-Last: Shao Title: Self-Normalization for Time Series: A Review of Recent Developments Abstract: This article reviews some recent developments on the inference of time series data using the self-normalized approach. We aim to provide a detailed discussion about the use of self-normalization in different contexts and highlight distinctive feature associated with each problem and connections among these recent developments. The topics covered include: confidence interval construction for a parameter in a weakly dependent stationary time series setting, change point detection in the mean, robust inference in regression models with weakly dependent errors, inference for nonparametric time series regression, inference for long memory time series, locally stationary time series and near-integrated time series, change point detection, and two-sample inference for functional time series, as well as the use of self-normalization for spatial data and spatial-temporal data. Some new variations of the self-normalized approach are also introduced with additional simulation results. We also provide a brief review of related inferential methods, such as blockwise empirical likelihood and subsampling, which were recently developed under the fixed-b asymptotic framework. We conclude the article with a summary of merits and limitations of self-normalization in the time series context and potential topics for future investigation. Journal: Journal of the American Statistical Association Pages: 1797-1817 Issue: 512 Volume: 110 Year: 2015 Month: 12 X-DOI: 10.1080/01621459.2015.1050493 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1050493 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1797-1817 Template-Type: ReDIF-Article 1.0 Author-Name: Ci-Ren Jiang Author-X-Name-First: Ci-Ren Author-X-Name-Last: Jiang Author-Name: John A. D. Aston Author-X-Name-First: John A. D. Author-X-Name-Last: Aston Author-Name: Jane-Ling Wang Author-X-Name-First: Jane-Ling Author-X-Name-Last: Wang Title: A Functional Approach to Deconvolve Dynamic Neuroimaging Data Abstract: Positron emission tomography (PET) is an imaging technique which can be used to investigate chemical changes in human biological processes such as cancer development or neurochemical reactions. Most dynamic PET scans are currently analyzed based on the assumption that linear first-order kinetics can be used to adequately describe the system under observation. However, there has recently been strong evidence that this is not the case. To provide an analysis of PET data which is free from this compartmental assumption, we propose a nonparametric deconvolution and analysis model for dynamic PET data based on functional principal component analysis. This yields flexibility in the possible deconvolved functions while still performing well when a linear compartmental model setup is the true data generating mechanism. As the deconvolution needs to be performed on only a relative small number of basis functions rather than voxel by voxel in the entire three-dimensional volume, the methodology is both robust to typical brain imaging noise levels while also being computationally efficient. The new methodology is investigated through simulations in both one-dimensional functions and 2D images and also applied to a neuroimaging study whose goal is the quantification of opioid receptor concentration in the brain. Journal: Journal of the American Statistical Association Pages: 1-13 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1060241 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1060241 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:1-13 Template-Type: ReDIF-Article 1.0 Author-Name: P. Richard Hahn Author-X-Name-First: P. Richard Author-X-Name-Last: Hahn Author-Name: Jared S. Murray Author-X-Name-First: Jared S. Author-X-Name-Last: Murray Author-Name: Ioanna Manolopoulou Author-X-Name-First: Ioanna Author-X-Name-Last: Manolopoulou Title: A Bayesian Partial Identification Approach to Inferring the Prevalence of Accounting Misconduct Abstract: This article describes the use of flexible Bayesian regression models for estimating a partially identified probability function. Our approach permits efficient sensitivity analysis concerning the posterior impact of priors on the partially identified component of the regression model. The new methodology is illustrated on an important problem where only partially observed data are available—inferring the prevalence of accounting misconduct among publicly traded U.S. businesses. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 14-26 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1084307 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1084307 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:14-26 Template-Type: ReDIF-Article 1.0 Author-Name: Zhiguang Huo Author-X-Name-First: Zhiguang Author-X-Name-Last: Huo Author-Name: Ying Ding Author-X-Name-First: Ying Author-X-Name-Last: Ding Author-Name: Silvia Liu Author-X-Name-First: Silvia Author-X-Name-Last: Liu Author-Name: Steffi Oesterreich Author-X-Name-First: Steffi Author-X-Name-Last: Oesterreich Author-Name: George Tseng Author-X-Name-First: George Author-X-Name-Last: Tseng Title: Meta-Analytic Framework for Sparse K-Means to Identify Disease Subtypes in Multiple Transcriptomic Studies Abstract: Disease phenotyping by omics data has become a popular approach that potentially can lead to better personalized treatment. Identifying disease subtypes via unsupervised machine learning is the first step toward this goal. In this article, we extend a sparse K-means method toward a meta-analytic framework to identify novel disease subtypes when expression profiles of multiple cohorts are available. The lasso regularization and meta-analysis identify a unique set of gene features for subtype characterization. An additional pattern matching reward function guarantees consistent subtype signatures across studies. The method was evaluated by simulations and leukemia and breast cancer datasets. The identified disease subtypes from meta-analysis were characterized with improved accuracy and stability compared to single study analysis. The breast cancer model was applied to an independent METABRIC dataset and generated improved survival difference between subtypes. These results provide a basis for diagnosis and development of targeted treatments for disease subgroups. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 27-42 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1086354 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1086354 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:27-42 Template-Type: ReDIF-Article 1.0 Author-Name: Mehdi Maadooliat Author-X-Name-First: Mehdi Author-X-Name-Last: Maadooliat Author-Name: Lan Zhou Author-X-Name-First: Lan Author-X-Name-Last: Zhou Author-Name: Seyed Morteza Najibi Author-X-Name-First: Seyed Morteza Author-X-Name-Last: Najibi Author-Name: Xin Gao Author-X-Name-First: Xin Author-X-Name-Last: Gao Author-Name: Jianhua Z. Huang Author-X-Name-First: Jianhua Z. Author-X-Name-Last: Huang Title: Collective Estimation of Multiple Bivariate Density Functions With Application to Angular-Sampling-Based Protein Loop Modeling Abstract: This article develops a method for simultaneous estimation of density functions for a collection of populations of protein backbone angle pairs using a data-driven, shared basis that is constructed by bivariate spline functions defined on a triangulation of the bivariate domain. The circular nature of angular data is taken into account by imposing appropriate smoothness constraints across boundaries of the triangles. Maximum penalized likelihood is used to fit the model and an alternating blockwise Newton-type algorithm is developed for computation. A simulation study shows that the collective estimation approach is statistically more efficient than estimating the densities individually. The proposed method was used to estimate neighbor-dependent distributions of protein backbone dihedral angles (i.e., Ramachandran distributions). The estimated distributions were applied to protein loop modeling, one of the most challenging open problems in protein structure prediction, by feeding them into an angular-sampling-based loop structure prediction framework. Our estimated distributions compared favorably to the Ramachandran distributions estimated by fitting a hierarchical Dirichlet process model; and in particular, our distributions showed significant improvements on the hard cases where existing methods do not work well. Journal: Journal of the American Statistical Association Pages: 43-56 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1099535 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1099535 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:43-56 Template-Type: ReDIF-Article 1.0 Author-Name: Won Chang Author-X-Name-First: Won Author-X-Name-Last: Chang Author-Name: Murali Haran Author-X-Name-First: Murali Author-X-Name-Last: Haran Author-Name: Patrick Applegate Author-X-Name-First: Patrick Author-X-Name-Last: Applegate Author-Name: David Pollard Author-X-Name-First: David Author-X-Name-Last: Pollard Title: Calibrating an Ice Sheet Model Using High-Dimensional Binary Spatial Data Abstract: Rapid retreat of ice in the Amundsen Sea sector of West Antarctica may cause drastic sea level rise, posing significant risks to populations in low-lying coastal regions. Calibration of computer models representing the behavior of the West Antarctic Ice Sheet is key for informative projections of future sea level rise. However, both the relevant observations and the model output are high-dimensional binary spatial data; existing computer model calibration methods are unable to handle such data. Here we present a novel calibration method for computer models whose output is in the form of binary spatial data. To mitigate the computational and inferential challenges posed by our approach, we apply a generalized principal component based dimension reduction method. To demonstrate the utility of our method, we calibrate the PSU3D-ICE model by comparing the output from a 499-member perturbed-parameter ensemble with observations from the Amundsen Sea sector of the ice sheet. Our methods help rigorously characterize the parameter uncertainty even in the presence of systematic data-model discrepancies and dependence in the errors. Our method also helps inform environmental risk analyses by contributing to improved projections of sea level rise from the ice sheets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 57-72 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1108199 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1108199 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:57-72 Template-Type: ReDIF-Article 1.0 Author-Name: Lisa M. Pham Author-X-Name-First: Lisa M. Author-X-Name-Last: Pham Author-Name: Luis Carvalho Author-X-Name-First: Luis Author-X-Name-Last: Carvalho Author-Name: Scott Schaus Author-X-Name-First: Scott Author-X-Name-Last: Schaus Author-Name: Eric D. Kolaczyk Author-X-Name-First: Eric D. Author-X-Name-Last: Kolaczyk Title: Perturbation Detection Through Modeling of Gene Expression on a Latent Biological Pathway Network: A Bayesian Hierarchical Approach Abstract: Cellular response to a perturbation is the result of a dynamic system of biological variables linked in a complex network. A major challenge in drug and disease studies is identifying the key factors of a biological network that are essential in determining the cell’s fate. Here, our goal is the identification of perturbed pathways from high-throughput gene expression data. We develop a three-level hierarchical model, where (i) the first level captures the relationship between gene expression and biological pathways using confirmatory factor analysis, (ii) the second level models the behavior within an underlying network of pathways induced by an unknown perturbation using a conditional autoregressive model, and (iii) the third level is a spike-and-slab prior on the perturbations. We then identify perturbations through posterior-based variable selection. We illustrate our approach using gene transcription drug perturbation profiles from the DREAM7 drug sensitivity predication challenge dataset. Our proposed method identified regulatory pathways that are known to play a causative role and that were not readily resolved using gene set enrichment analysis or exploratory factor models. Simulation results are presented assessing the performance of this model relative to a network-free variant and its robustness to inaccuracies in biological databases. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 73-92 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1110523 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1110523 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:73-92 Template-Type: ReDIF-Article 1.0 Author-Name: Matthew R. Schofield Author-X-Name-First: Matthew R. Author-X-Name-Last: Schofield Author-Name: Richard J. Barker Author-X-Name-First: Richard J. Author-X-Name-Last: Barker Author-Name: Andrew Gelman Author-X-Name-First: Andrew Author-X-Name-Last: Gelman Author-Name: Edward R. Cook Author-X-Name-First: Edward R. Author-X-Name-Last: Cook Author-Name: Keith R. Briffa Author-X-Name-First: Keith R. Author-X-Name-Last: Briffa Title: A Model-Based Approach to Climate Reconstruction Using Tree-Ring Data Abstract: Quantifying long-term historical climate is fundamental to understanding recent climate change. Most instrumentally recorded climate data are only available for the past 200 years, so proxy observations from natural archives are often considered. We describe a model-based approach to reconstructing climate defined in terms of raw tree-ring measurement data that simultaneously accounts for nonclimatic and climatic variability. In this approach, we specify a joint model for the tree-ring data and climate variable that we fit using Bayesian inference. We consider a range of prior densities and compare the modeling approach to current methodology using an example case of Scots pine from Torneträsk, Sweden, to reconstruct growing season temperature. We describe how current approaches translate into particular model assumptions. We explore how changes to various components in the model-based approach affect the resulting reconstruction. We show that minor changes in model specification can have little effect on model fit but lead to large changes in the predictions. In particular, the periods of relatively warmer and cooler temperatures are robust between models, but the magnitude of the resulting temperatures is highly model dependent. Such sensitivity may not be apparent with traditional approaches because the underlying statistical model is often hidden or poorly described. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 93-106 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1110524 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1110524 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:93-106 Template-Type: ReDIF-Article 1.0 Author-Name: Nilanjan Chatterjee Author-X-Name-First: Nilanjan Author-X-Name-Last: Chatterjee Author-Name: Yi-Hau Chen Author-X-Name-First: Yi-Hau Author-X-Name-Last: Chen Author-Name: Paige Maas Author-X-Name-First: Paige Author-X-Name-Last: Maas Author-Name: Raymond J. Carroll Author-X-Name-First: Raymond J. Author-X-Name-Last: Carroll Title: Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-Level Information From External Big Data Sources Abstract: Information from various public and private data sources of extremely large sample sizes are now increasingly available for research purposes. Statistical methods are needed for using information from such big data sources while analyzing data from individual studies that may collect more detailed information required for addressing specific hypotheses of interest. In this article, we consider the problem of building regression models based on individual-level data from an “internal” study while using summary-level information, such as information on parameters for reduced models, from an “external” big data source. We identify a set of very general constraints that link internal and external models. These constraints are used to develop a framework for semiparametric maximum likelihood inference that allows the distribution of covariates to be estimated using either the internal sample or an external reference sample. We develop extensions for handling complex stratified sampling designs, such as case-control sampling, for the internal study. Asymptotic theory and variance estimators are developed for each case. We use simulation studies and a real data application to assess the performance of the proposed methods in contrast to the generalized regression calibration methodology that is popular in the sample survey literature. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 107-117 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1123157 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1123157 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:107-117 Template-Type: ReDIF-Article 1.0 Author-Name: Peisong Han Author-X-Name-First: Peisong Author-X-Name-Last: Han Author-Name: Jerald F. Lawless Author-X-Name-First: Jerald F. Author-X-Name-Last: Lawless Title: Comment Journal: Journal of the American Statistical Association Pages: 118-121 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2016.1149399 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1149399 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:118-121 Template-Type: ReDIF-Article 1.0 Author-Name: Sebastien Haneuse Author-X-Name-First: Sebastien Author-X-Name-Last: Haneuse Author-Name: Claudia Rivera Author-X-Name-First: Claudia Author-X-Name-Last: Rivera Title: Comment Journal: Journal of the American Statistical Association Pages: 121-122 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2016.1149401 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1149401 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:121-122 Template-Type: ReDIF-Article 1.0 Author-Name: Thomas A. Louis Author-X-Name-First: Thomas A. Author-X-Name-Last: Louis Author-Name: Niels Keiding Author-X-Name-First: Niels Author-X-Name-Last: Keiding Title: Comment Journal: Journal of the American Statistical Association Pages: 123-124 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2016.1149403 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1149403 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:123-124 Template-Type: ReDIF-Article 1.0 Author-Name: Joel A. Mefford Author-X-Name-First: Joel A. Author-X-Name-Last: Mefford Author-Name: Noah A. Zaitlen Author-X-Name-First: Noah A. Author-X-Name-Last: Zaitlen Author-Name: John S. Witte Author-X-Name-First: John S. Author-X-Name-Last: Witte Title: Comment: A Human Genetics Perspective Journal: Journal of the American Statistical Association Pages: 124-127 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2016.1149404 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1149404 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:124-127 Template-Type: ReDIF-Article 1.0 Author-Name: Chirag J. Patel Author-X-Name-First: Chirag J. Author-X-Name-Last: Patel Author-Name: Francesca Dominici Author-X-Name-First: Francesca Author-X-Name-Last: Dominici Title: Comment: Addressing the Need for Portability in Big Data Model Building and Calibration Journal: Journal of the American Statistical Association Pages: 127-129 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2016.1149406 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1149406 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:127-129 Template-Type: ReDIF-Article 1.0 Author-Name: Nilanjan Chatterjee Author-X-Name-First: Nilanjan Author-X-Name-Last: Chatterjee Author-Name: Yi-Hau Chen Author-X-Name-First: Yi-Hau Author-X-Name-Last: Chen Author-Name: Paige Maas Author-X-Name-First: Paige Author-X-Name-Last: Maas Author-Name: Raymond J. Carroll Author-X-Name-First: Raymond J. Author-X-Name-Last: Carroll Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 130-131 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2016.1149407 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1149407 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:130-131 Template-Type: ReDIF-Article 1.0 Author-Name: Hyunseung Kang Author-X-Name-First: Hyunseung Author-X-Name-Last: Kang Author-Name: Anru Zhang Author-X-Name-First: Anru Author-X-Name-Last: Zhang Author-Name: T. Tony Cai Author-X-Name-First: T. Tony Author-X-Name-Last: Cai Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Title: Instrumental Variables Estimation With Some Invalid Instruments and its Application to Mendelian Randomization Abstract: Instrumental variables have been widely used for estimating the causal effect between exposure and outcome. Conventional estimation methods require complete knowledge about all the instruments’ validity; a valid instrument must not have a direct effect on the outcome and not be related to unmeasured confounders. Often, this is impractical as highlighted by Mendelian randomization studies where genetic markers are used as instruments and complete knowledge about instruments’ validity is equivalent to complete knowledge about the involved genes’ functions. In this article, we propose a method for estimation of causal effects when this complete knowledge is absent. It is shown that causal effects are identified and can be estimated as long as less than 50% of instruments are invalid, without knowing which of the instruments are invalid. We also introduce conditions for identification when the 50% threshold is violated. A fast penalized ℓ1 estimation method, called sisVIVE, is introduced for estimating the causal effect without knowing which instruments are valid, with theoretical guarantees on its performance. The proposed method is demonstrated on simulated data and a real Mendelian randomization study concerning the effect of body mass index(BMI) on health-related quality of life (HRQL) index. An R package sisVIVE is available on CRAN. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 132-144 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2014.994705 File-URL: http://hdl.handle.net/10.1080/01621459.2014.994705 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:132-144 Template-Type: ReDIF-Article 1.0 Author-Name: Xiaoyan Sun Author-X-Name-First: Xiaoyan Author-X-Name-Last: Sun Author-Name: Limin Peng Author-X-Name-First: Limin Author-X-Name-Last: Peng Author-Name: Yijian Huang Author-X-Name-First: Yijian Author-X-Name-Last: Huang Author-Name: HuiChuan J. Lai Author-X-Name-First: HuiChuan J. Author-X-Name-Last: Lai Title: Generalizing Quantile Regression for Counting Processes With Applications to Recurrent Events Abstract: In survival analysis, quantile regression has become a useful approach to account for covariate effects on the distribution of an event time of interest. In this article, we discuss how quantile regression can be extended to model counting processes and thus lead to a broader regression framework for survival data. We specifically investigate the proposed modeling of counting processes for recurrent events data. We show that the new recurrent events model retains the desirable features of quantile regression such as easy interpretation and good model flexibility, while accommodating various observation schemes encountered in observational studies. We develop a general theoretical and inferential framework for the new counting process model, which unifies with an existing method for censored quantile regression. As another useful contribution of this work, we propose a sample-based covariance estimation procedure, which provides a useful complement to the prevailing bootstrapping approach. We demonstrate the utility of our proposals via simulation studies and an application to a dataset from the U.S. Cystic Fibrosis Foundation Patient Registry (CFFPR). Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 145-156 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2014.995795 File-URL: http://hdl.handle.net/10.1080/01621459.2014.995795 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:145-156 Template-Type: ReDIF-Article 1.0 Author-Name: Peng Ding Author-X-Name-First: Peng Author-X-Name-Last: Ding Author-Name: Tirthankar Dasgupta Author-X-Name-First: Tirthankar Author-X-Name-Last: Dasgupta Title: A Potential Tale of Two-by-Two Tables From Completely Randomized Experiments Abstract: Causal inference in completely randomized treatment-control studies with binary outcomes is discussed from Fisherian, Neymanian, and Bayesian perspectives, using the potential outcomes model. A randomization-based justification of Fisher’s exact test is provided. Arguing that the crucial assumption of constant causal effect is often unrealistic, and holds only for extreme cases, some new asymptotic and Bayesian inferential procedures are proposed. The proposed procedures exploit the intrinsic nonadditivity of unit-level causal effects, can be applied to linear and nonlinear estimands, and dominate the existing methods, as verified theoretically and also through simulation studies. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 157-168 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2014.995796 File-URL: http://hdl.handle.net/10.1080/01621459.2014.995796 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:157-168 Template-Type: ReDIF-Article 1.0 Author-Name: Rui Pan Author-X-Name-First: Rui Author-X-Name-Last: Pan Author-Name: Hansheng Wang Author-X-Name-First: Hansheng Author-X-Name-Last: Wang Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Title: Ultrahigh-Dimensional Multiclass Linear Discriminant Analysis by Pairwise Sure Independence Screening Abstract: This article is concerned with the problem of feature screening for multiclass linear discriminant analysis under ultrahigh-dimensional setting. We allow the number of classes to be relatively large. As a result, the total number of relevant features is larger than usual. This makes the related classification problem much more challenging than the conventional one, where the number of classes is small (very often two). To solve the problem, we propose a novel pairwise sure independence screening method for linear discriminant analysis with an ultrahigh-dimensional predictor. The proposed procedure is directly applicable to the situation with many classes. We further prove that the proposed method is screening consistent. Simulation studies are conducted to assess the finite sample performance of the new procedure. We also demonstrate the proposed methodology via an empirical analysis of a real life example on handwritten Chinese character recognition. Journal: Journal of the American Statistical Association Pages: 169-179 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2014.998760 File-URL: http://hdl.handle.net/10.1080/01621459.2014.998760 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:169-179 Template-Type: ReDIF-Article 1.0 Author-Name: Wenjiang Fu Author-X-Name-First: Wenjiang Author-X-Name-Last: Fu Title: Constrained Estimators and Consistency of a Regression Model on a Lexis Diagram Abstract: This article considers a regression model on a Lexis diagram of an a × p table with a single response in each cell following a distribution in the exponential family. A regression model on the fixed effects of a rows, p columns, and a + p − 1 diagonals induces a singular design matrix and yields multiple estimators, leading to parameter identifiability problem in age--period--cohort analysis in social sciences, demography, and epidemiology, where assessment of secular trend in age, period, and birth cohort of social events (e.g., violence) and diseases (e.g., cancer) is of interest. Similar problems also exist in other settings, such as in supersaturated designs. In this article, we study the finite sample properties of the multiple estimators, propose a penalized profile likelihood method to study the consistency and asymptotic bias, and demonstrate the results through simulations and data analysis. As a by-product, the identifiability problem is addressed with consistent estimation for model parameters and secular trend. We conclude that consistent estimation can be identified through estimable function and asymptotics studies in regressions with a singular design. Our method provides a novel approach to studying asymptotics of multiple estimators with a diverging number of nuisance parameters. Journal: Journal of the American Statistical Association Pages: 180-199 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2014.998761 File-URL: http://hdl.handle.net/10.1080/01621459.2014.998761 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:180-199 Template-Type: ReDIF-Article 1.0 Author-Name: Michalis K. Titsias Author-X-Name-First: Michalis K. Author-X-Name-Last: Titsias Author-Name: Christopher C. Holmes Author-X-Name-First: Christopher C. Author-X-Name-Last: Holmes Author-Name: Christopher Yau Author-X-Name-First: Christopher Author-X-Name-Last: Yau Title: Statistical Inference in Hidden Markov Models Using k-Segment Constraints Abstract: Hidden Markov models (HMMs) are one of the most widely used statistical methods for analyzing sequence data. However, the reporting of output from HMMs has largely been restricted to the presentation of the most-probable (MAP) hidden state sequence, found via the Viterbi algorithm, or the sequence of most probable marginals using the forward--backward algorithm. In this article, we expand the amount of information we could obtain from the posterior distribution of an HMM by introducing linear-time dynamic programming recursions that, conditional on a user-specified constraint in the number of segments, allow us to (i) find MAP sequences, (ii) compute posterior probabilities, and (iii) simulate sample paths. We collectively call these recursions k-segment algorithms and illustrate their utility using simulated and real examples. We also highlight the prospective and retrospective use of k-segment constraints for fitting HMMs or exploring existing model fits. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 200-215 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2014.998762 File-URL: http://hdl.handle.net/10.1080/01621459.2014.998762 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:200-215 Template-Type: ReDIF-Article 1.0 Author-Name: Francesco Bartolucci Author-X-Name-First: Francesco Author-X-Name-Last: Bartolucci Author-Name: Monia Lupparelli Author-X-Name-First: Monia Author-X-Name-Last: Lupparelli Title: Pairwise Likelihood Inference for Nested Hidden Markov Chain Models for Multilevel Longitudinal Data Abstract: In the context of multilevel longitudinal data, where sample units are collected in clusters, an important aspect that should be accounted for is the unobserved heterogeneity between sample units and between clusters. For this aim, we propose an approach based on nested hidden (latent) Markov chains, which are associated with every sample unit and with every cluster. The approach allows us to account for the previously mentioned forms of unobserved heterogeneity in a dynamic fashion; it also allows us to account for the correlation that may arise between the responses provided by the units belonging to the same cluster. Under the assumed model, computing the manifest distribution of these response variables is infeasible even with a few units per cluster. Therefore, we make inference on this model through a composite likelihood function based on all the possible pairs of subjects within each cluster. Properties of the composite likelihood estimator are assessed by simulation. The proposed approach is illustrated through an application to a dataset concerning a sample of Italian workers in which a binary response variable for the worker receiving an illness benefit was repeatedly observed. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 216-228 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2014.998935 File-URL: http://hdl.handle.net/10.1080/01621459.2014.998935 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:216-228 Template-Type: ReDIF-Article 1.0 Author-Name: T. Tony Cai Author-X-Name-First: T. Tony Author-X-Name-Last: Cai Author-Name: Weidong Liu Author-X-Name-First: Weidong Author-X-Name-Last: Liu Title: Large-Scale Multiple Testing of Correlations Abstract: Multiple testing of correlations arises in many applications including gene coexpression network analysis and brain connectivity analysis. In this article, we consider large-scale simultaneous testing for correlations in both the one-sample and two-sample settings. New multiple testing procedures are proposed and a bootstrap method is introduced for estimating the proportion of the nulls falsely rejected among all the true nulls. We investigate the properties of the proposed procedures both theoretically and numerically. It is shown that the procedures asymptotically control the overall false discovery rate and false discovery proportion at the nominal level. Simulation results show that the methods perform well numerically in terms of both the size and power of the test and it significantly outperforms two alternative methods. The two-sample procedure is also illustrated by an analysis of a prostate cancer dataset for the detection of changes in coexpression patterns between gene expression levels. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 229-240 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2014.999157 File-URL: http://hdl.handle.net/10.1080/01621459.2014.999157 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:229-240 Template-Type: ReDIF-Article 1.0 Author-Name: Yunzhang Zhu Author-X-Name-First: Yunzhang Author-X-Name-Last: Zhu Author-Name: Xiaotong Shen Author-X-Name-First: Xiaotong Author-X-Name-Last: Shen Author-Name: Changqing Ye Author-X-Name-First: Changqing Author-X-Name-Last: Ye Title: Personalized Prediction and Sparsity Pursuit in Latent Factor Models Abstract: Personalized information filtering extracts the information specifically relevant to a user, predicting his/her preference over a large number of items, based on the opinions of users who think alike or its content. This problem is cast into the framework of regression and classification, where we integrate additional user-specific and content-specific predictors in partial latent models, for higher predictive accuracy. In particular, we factorize a user-over-item preference matrix into a product of two matrices, each representing a user’s preference and an item preference by users. Then we propose a likelihood method to seek a sparsest latent factorization, from a class of overcomplete factorizations, possibly with a high percentage of missing values. This promotes additional sparsity beyond rank reduction. Computationally, we design methods based on a “decomposition and combination” strategy, to break large-scale optimization into many small subproblems to solve in a recursive and parallel manner. On this basis, we implement the proposed methods through multi-platform shared-memory parallel programming, and through Mahout, a library for scalable machine learning and data mining, for mapReduce computation. For example, our methods are scalable to a dataset consisting of three billions of observations on a single machine with sufficient memory, having good timings. Both theoretical and numerical investigations show that the proposed methods exhibit a significant improvement in accuracy over state-of-the-art scalable methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 241-252 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2014.999158 File-URL: http://hdl.handle.net/10.1080/01621459.2014.999158 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:241-252 Template-Type: ReDIF-Article 1.0 Author-Name: T. Tony Cai Author-X-Name-First: T. Tony Author-X-Name-Last: Cai Author-Name: Ming Yuan Author-X-Name-First: Ming Author-X-Name-Last: Yuan Title: Minimax and Adaptive Estimation of Covariance Operator for Random Variables Observed on a Lattice Graph Abstract: Covariance structure plays an important role in high-dimensional statistical inference. In a range of applications including imaging analysis and fMRI studies, random variables are observed on a lattice graph. In such a setting, it is important to account for the lattice structure when estimating the covariance operator. In this article, we consider both minimax and adaptive estimation of the covariance operator over collections of polynomially decaying and exponentially decaying parameter spaces. We first establish the minimax rates of convergence for estimating the covariance operator under the operator norm. The results show that the dimension of the lattice graph significantly affects the optimal rates convergence, often much more so than the dimension of the random variables. We then consider adaptive estimation of the covariance operator. A fully data-driven block thresholding procedure is proposed and is shown to be adaptively rate optimal simultaneously over a wide range of polynomially decaying and exponentially decaying parameter spaces. The adaptive block thresholding procedure is easy to implement, and numerical experiments are carried out to illustrate the merit of the procedure. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 253-265 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2014.1001067 File-URL: http://hdl.handle.net/10.1080/01621459.2014.1001067 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:253-265 Template-Type: ReDIF-Article 1.0 Author-Name: Xingdong Feng Author-X-Name-First: Xingdong Author-X-Name-Last: Feng Author-Name: Liping Zhu Author-X-Name-First: Liping Author-X-Name-Last: Zhu Title: Estimation and Testing of Varying Coefficients in Quantile Regression Abstract: In this article, we establish a novel connection between the null hypothesis H0 on the coefficients and a rank-reducible form of the varying coefficient model in quantile regression. We use B-splines to approximate the varying coefficients in the rank-reducible model, and make use of the fact that the null hypothesis H0 implies a unidimensional structure of a transformed coefficient matrix for the B-spline basis functions. By evaluating the unidimensional structure, we alleviate the difficulty of testing such hypotheses commonly considered in varying coefficient quantile models. We demonstrate through numerical studies that the proposed method can be much more powerful than the rank score test which is widely used in the quantile regression literature. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 266-274 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2014.1001068 File-URL: http://hdl.handle.net/10.1080/01621459.2014.1001068 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:266-274 Template-Type: ReDIF-Article 1.0 Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Yang Feng Author-X-Name-First: Yang Author-X-Name-Last: Feng Author-Name: Jiancheng Jiang Author-X-Name-First: Jiancheng Author-X-Name-Last: Jiang Author-Name: Xin Tong Author-X-Name-First: Xin Author-X-Name-Last: Tong Title: Feature Augmentation via Nonparametrics and Selection (FANS) in High-Dimensional Classification Abstract: We propose a high-dimensional classification method that involves nonparametric feature augmentation. Knowing that marginal density ratios are the most powerful univariate classifiers, we use the ratio estimates to transform the original feature measurements. Subsequently, penalized logistic regression is invoked, taking as input the newly transformed or augmented features. This procedure trains models equipped with local complexity and global simplicity, thereby avoiding the curse of dimensionality while creating a flexible nonlinear decision boundary. The resulting method is called feature augmentation via nonparametrics and selection (FANS). We motivate FANS by generalizing the naive Bayes model, writing the log ratio of joint densities as a linear combination of those of marginal densities. It is related to generalized additive models, but has better interpretability and computability. Risk bounds are developed for FANS. In numerical analysis, FANS is compared with competing methods, so as to provide a guideline on its best application domain. Real data analysis demonstrates that FANS performs very competitively on benchmark email spam and gene expression datasets. Moreover, FANS is implemented by an extremely fast algorithm through parallel computing. Journal: Journal of the American Statistical Association Pages: 275-287 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1005212 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1005212 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:275-287 Template-Type: ReDIF-Article 1.0 Author-Name: Jianhua Guo Author-X-Name-First: Jianhua Author-X-Name-Last: Guo Author-Name: Jianchang Hu Author-X-Name-First: Jianchang Author-X-Name-Last: Hu Author-Name: Bing-Yi Jing Author-X-Name-First: Bing-Yi Author-X-Name-Last: Jing Author-Name: Zhen Zhang Author-X-Name-First: Zhen Author-X-Name-Last: Zhang Title: Spline-Lasso in High-Dimensional Linear Regression Abstract: We consider a high-dimensional linear regression problem, where the covariates (features) are ordered in some meaningful way, and the number of covariates p can be much larger than the sample size n. The fused lasso of Tibshirani et al. is designed especially to tackle this type of problems; it yields sparse coefficients and selects grouped variables, and encourages local constant coefficient profile within each group. However, in some applications, the effects of different features within a group might be different and change smoothly. In this article, we propose a new spline-lasso or more generally, spline-MCP to better capture the different effects within the group. The newly proposed method is very easy to implement since it can be easily turned into a lasso or MCP problem. Simulations show that the method works very effectively both in feature selection and prediction accuracy. A real application is also given to illustrate the benefits of the method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 288-297 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1005839 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1005839 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:288-297 Template-Type: ReDIF-Article 1.0 Author-Name: Wentao Li Author-X-Name-First: Wentao Author-X-Name-Last: Li Author-Name: Rong Chen Author-X-Name-First: Rong Author-X-Name-Last: Chen Author-Name: Zhiqiang Tan Author-X-Name-First: Zhiqiang Author-X-Name-Last: Tan Title: Efficient Sequential Monte Carlo With Multiple Proposals and Control Variates Abstract: Sequential Monte Carlo is a useful simulation-based method for online filtering of state-space models. For certain complex state-space models, a single proposal distribution is usually not satisfactory and using multiple proposal distributions is a general approach to address various aspects of the filtering problem. This article proposes an efficient method of using multiple proposals in combination with control variates. The likelihood approach of Tan (2004) is used in both resampling and estimation. The new algorithm is shown to be asymptotically more efficient than the direct use of multiple proposals and control variates. The guidance for selecting multiple proposals and control variates is also given. Numerical examples are used to demonstrate that the new algorithm can significantly improve over the bootstrap filter and auxiliary particle filter. Journal: Journal of the American Statistical Association Pages: 298-313 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1006364 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1006364 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:298-313 Template-Type: ReDIF-Article 1.0 Author-Name: Chao Du Author-X-Name-First: Chao Author-X-Name-Last: Du Author-Name: Chu-Lan Michael Kao Author-X-Name-First: Chu-Lan Michael Author-X-Name-Last: Kao Author-Name: S. C. Kou Author-X-Name-First: S. C. Author-X-Name-Last: Kou Title: Stepwise Signal Extraction via Marginal Likelihood Abstract: This article studies the estimation of a stepwise signal. To determine the number and locations of change-points of the stepwise signal, we formulate a maximum marginal likelihood estimator, which can be computed with a quadratic cost using dynamic programming. We carry out an extensive investigation on the choice of the prior distribution and study the asymptotic properties of the maximum marginal likelihood estimator. We propose to treat each possible set of change-points equally and adopt an empirical Bayes approach to specify the prior distribution of segment parameters. A detailed simulation study is performed to compare the effectiveness of this method with other existing methods. We demonstrate our method on single-molecule enzyme reaction data and on DNA array comparative genomic hybridization (CGH) data. Our study shows that this method is applicable to a wide range of models and offers appealing results in practice. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 314-330 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1006365 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1006365 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:314-330 Template-Type: ReDIF-Article 1.0 Author-Name: Jacopo Mandozzi Author-X-Name-First: Jacopo Author-X-Name-Last: Mandozzi Author-Name: Peter Bühlmann Author-X-Name-First: Peter Author-X-Name-Last: Bühlmann Title: Hierarchical Testing in the High-Dimensional Setting With Correlated Variables Abstract: We propose a method for testing whether hierarchically ordered groups of potentially correlated variables are significant for explaining a response in a high-dimensional linear model. In presence of highly correlated variables, as is very common in high-dimensional data, it seems indispensable to go beyond an approach of inferring individual regression coefficients, and we show that detecting smallest groups of variables (MTDs: minimal true detections) is realistic. Thanks to the hierarchy among the groups of variables, powerful multiple testing adjustment is possible which leads to a data-driven choice of the resolution level for the groups. Our procedure, based on repeated sample splitting, is shown to asymptotically control the familywise error rate and we provide empirical results for simulated and real data which complement the theoretical analysis. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 331-343 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1007209 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1007209 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:331-343 Template-Type: ReDIF-Article 1.0 Author-Name: Ying Wei Author-X-Name-First: Ying Author-X-Name-Last: Wei Author-Name: Xiaoyu Song Author-X-Name-First: Xiaoyu Author-X-Name-Last: Song Author-Name: Mengling Liu Author-X-Name-First: Mengling Author-X-Name-Last: Liu Author-Name: Iuliana Ionita-Laza Author-X-Name-First: Iuliana Author-X-Name-Last: Ionita-Laza Author-Name: Joan Reibman Author-X-Name-First: Joan Author-X-Name-Last: Reibman Title: Quantile Regression in the Secondary Analysis of Case--Control Data Abstract: Case--control design is widely used in epidemiology and other fields to identify factors associated with a disease. Data collected from existing case--control studies can also provide a cost-effective way to investigate the association of risk factors with secondary outcomes. When the secondary outcome is a continuous random variable, most of the existing methods focus on the statistical inference on the mean of the secondary outcome. In this article, we propose a quantile-based approach to facilitating a comprehensive investigation of covariates’ effects on multiple quantiles of the secondary outcome. We construct a new family of estimating equations combining observed and pseudo outcomes, which lead to consistent estimation of conditional quantiles using case--control data. Simulations are conducted to evaluate the performance of our proposed approach, and a case--control study on genetic association with asthma is used to demonstrate the method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 344-354 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1008101 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1008101 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:344-354 Template-Type: ReDIF-Article 1.0 Author-Name: Yuan Jiang Author-X-Name-First: Yuan Author-X-Name-Last: Jiang Author-Name: Yunxiao He Author-X-Name-First: Yunxiao Author-X-Name-Last: He Author-Name: Heping Zhang Author-X-Name-First: Heping Author-X-Name-Last: Zhang Title: Variable Selection With Prior Information for Generalized Linear Models via the Prior LASSO Method Abstract: LASSO is a popular statistical tool often used in conjunction with generalized linear models that can simultaneously select variables and estimate parameters. When there are many variables of interest, as in current biological and biomedical studies, the power of LASSO can be limited. Fortunately, so much biological and biomedical data have been collected and they may contain useful information about the importance of certain variables. This article proposes an extension of LASSO, namely, prior LASSO (pLASSO), to incorporate that prior information into penalized generalized linear models. The goal is achieved by adding in the LASSO criterion function an additional measure of the discrepancy between the prior information and the model. For linear regression, the whole solution path of the pLASSO estimator can be found with a procedure similar to the least angle regression (LARS). Asymptotic theories and simulation results show that pLASSO provides significant improvement over LASSO when the prior information is relatively accurate. When the prior information is less reliable, pLASSO shows great robustness to the misspecification. We illustrate the application of pLASSO using a real dataset from a genome-wide association study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 355-376 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1008363 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1008363 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:355-376 Template-Type: ReDIF-Article 1.0 Author-Name: Faming Liang Author-X-Name-First: Faming Author-X-Name-Last: Liang Author-Name: Ick Hoon Jin Author-X-Name-First: Ick Hoon Author-X-Name-Last: Jin Author-Name: Qifan Song Author-X-Name-First: Qifan Author-X-Name-Last: Song Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Title: An Adaptive Exchange Algorithm for Sampling From Distributions With Intractable Normalizing Constants Abstract: Sampling from the posterior distribution for a model whose normalizing constant is intractable is a long-standing problem in statistical research. We propose a new algorithm, adaptive auxiliary variable exchange algorithm, or, in short, adaptive exchange (AEX) algorithm, to tackle this problem. The new algorithm can be viewed as a MCMC extension of the exchange algorithm, which generates auxiliary variables via an importance sampling procedure from a Markov chain running in parallel. The convergence of the algorithm is established under mild conditions. Compared to the exchange algorithm, the new algorithm removes the requirement that the auxiliary variables must be drawn using a perfect sampler, and thus can be applied to many models for which the perfect sampler is not available or very expensive. Compared to the approximate exchange algorithms, such as the double Metropolis-Hastings sampler, the new algorithm overcomes their theoretical difficulty in convergence. The new algorithm is tested on the spatial autologistic and autonormal models. The numerical results indicate that the new algorithm is particularly useful for the problems for which the underlying system is strongly dependent. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 377-393 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1009072 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1009072 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:377-393 Template-Type: ReDIF-Article 1.0 Author-Name: Mengjie Chen Author-X-Name-First: Mengjie Author-X-Name-Last: Chen Author-Name: Zhao Ren Author-X-Name-First: Zhao Author-X-Name-Last: Ren Author-Name: Hongyu Zhao Author-X-Name-First: Hongyu Author-X-Name-Last: Zhao Author-Name: Harrison Zhou Author-X-Name-First: Harrison Author-X-Name-Last: Zhou Title: Asymptotically Normal and Efficient Estimation of Covariate-Adjusted Gaussian Graphical Model Abstract: We propose an asymptotically normal and efficient procedure to estimate every finite subgraph for covariate-adjusted Gaussian graphical model. As a consequence, a confidence interval as well as p-value can be obtained for each edge. The procedure is tuning-free and enjoys easy implementation and efficient computation through parallel estimation on subgraphs or edges. We apply the asymptotic normality result to perform support recovery through edge-wise adaptive thresholding. This support recovery procedure is called ANTAC, standing for asymptotically normal estimation with thresholding after adjusting covariates. ANTAC outperforms other methodologies in the literature in a range of simulation studies. We apply ANTAC to identify gene--gene interactions using an eQTL dataset. Our result achieves better interpretability and accuracy in comparison with a state-of-the-art method. Supplementary materials for the article are available online. Journal: Journal of the American Statistical Association Pages: 394-406 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1010039 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1010039 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:394-406 Template-Type: ReDIF-Article 1.0 Author-Name: Matthew Reimherr Author-X-Name-First: Matthew Author-X-Name-Last: Reimherr Author-Name: Dan Nicolae Author-X-Name-First: Dan Author-X-Name-Last: Nicolae Title: Estimating Variance Components in Functional Linear Models With Applications to Genetic Heritability Abstract: Quantifying heritability is the first step in understanding the contribution of genetic variation to the risk architecture of complex human diseases and traits. Heritability can be estimated for univariate phenotypes from nonfamily data using linear mixed effects models. There is, however, no fully developed methodology for defining or estimating heritability from longitudinal studies. By examining longitudinal studies, researchers have the opportunity to better understand the genetic influence on the temporal development of diseases, which can be vital for populations with rapidly changing phenotypes such as children or the elderly. To define and estimate heritability for longitudinally measured phenotypes, we present a framework based on functional data analysis, FDA. While our procedures have important genetic consequences, they also represent a substantial development for FDA. In particular, we present a very general methodology for constructing optimal, unbiased estimates of variance components in functional linear models. Such a problem is challenging as likelihoods and densities do not readily generalize to infinite-dimensional settings. Our procedure can be viewed as a functional generalization of the minimum norm quadratic unbiased estimation procedure, MINQUE, presented by C. R. Rao, and is equivalent to residual maximum likelihood, REML, in univariate settings. We apply our methodology to the Childhood Asthma Management Program, CAMP, a 4-year longitudinal study examining the long term effects of daily asthma medications on children. Journal: Journal of the American Statistical Association Pages: 407-422 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1016224 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1016224 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:407-422 Template-Type: ReDIF-Article 1.0 Author-Name: Zeng-Hua Lu Author-X-Name-First: Zeng-Hua Author-X-Name-Last: Lu Title: Extended MaxT Tests of One-Sided Hypotheses Abstract: In many statistical applications of one-sided tests of multiple hypotheses researchers are often concerned not only with global tests of the intersection of individual hypotheses, but also with multiple tests of individual hypotheses. For example, in clinical trial studies researchers often need to find out the efficacy of a treatment, as well as the significance of each outcome measurement (endpoint) of the treatment. This article proposes MaxT type tests aiming at improving the global power of existing MaxT tests. Our extended MaxT tests are constructed by adding an extra component to the maximand set of existing MaxT tests. The added component is a weighted sum of other components. Some power properties relating to choices of weight are studied. Our simulation study shows that the proposed tests can considerably improve the global power of existing MaxT tests and can also outperform many other global tests under some alternatives and/or some nonnormal distributions. Furthermore, it is shown that such global power improvement may involve little loss of power on multiple testing. Two real data examples on clinical trial studies reported in the literature are reexamined. The results of our tests suggest stronger evidence on treatment effects over MaxT tests and likelihood ratio tests while changing little on the evidence concerning endpoint testing. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 423-437 Issue: 513 Volume: 111 Year: 2016 Month: 3 X-DOI: 10.1080/01621459.2015.1019509 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1019509 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:423-437 Template-Type: ReDIF-Article 1.0 Author-Name: Daniele Durante Author-X-Name-First: Daniele Author-X-Name-Last: Durante Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Author-Name: Joshua T. Vogelstein Author-X-Name-First: Joshua T. Author-X-Name-Last: Vogelstein Title: Nonparametric Bayes Modeling of Populations of Networks Abstract: Replicated network data are increasingly available in many research fields. For example, in connectomic applications, interconnections among brain regions are collected for each patient under study, motivating statistical models which can flexibly characterize the probabilistic generative mechanism underlying these network-valued data. Available models for a single network are not designed specifically for inference on the entire probability mass function of a network-valued random variable and therefore lack flexibility in characterizing the distribution of relevant topological structures. We propose a flexible Bayesian nonparametric approach for modeling the population distribution of network-valued data. The joint distribution of the edges is defined via a mixture model that reduces dimensionality and efficiently incorporates network information within each mixture component by leveraging latent space representations. The formulation leads to an efficient Gibbs sampler and provides simple and coherent strategies for inference and goodness-of-fit assessments. We provide theoretical results on the flexibility of our model and illustrate improved performance—compared to state-of-the-art models—in simulations and application to human brain networks. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1516-1530 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1219260 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1219260 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1516-1530 Template-Type: ReDIF-Article 1.0 Author-Name: Xinyu Zhang Author-X-Name-First: Xinyu Author-X-Name-Last: Zhang Author-Name: Haiying Wang Author-X-Name-First: Haiying Author-X-Name-Last: Wang Author-Name: Yanyuan Ma Author-X-Name-First: Yanyuan Author-X-Name-Last: Ma Author-Name: Raymond J. Carroll Author-X-Name-First: Raymond J. Author-X-Name-Last: Carroll Title: Linear Model Selection When Covariates Contain Errors Abstract: Prediction precision is arguably the most relevant criterion of a model in practice and is often a sought after property. A common difficulty with covariates measured with errors is the impossibility of performing prediction evaluation on the data even if a model is completely given without any unknown parameters. We bypass this inherent difficulty by using special properties on moment relations in linear regression models with measurement errors. The end product is a model selection procedure that achieves the same optimality properties that are achieved in classical linear regression models without covariate measurement error. Asymptotically, the procedure selects the model with the minimum prediction error in general, and selects the smallest correct model if the regression relation is indeed linear. Our model selection procedure is useful in prediction when future covariates without measurement error become available, for example, due to improved technology or better management and design of data collection procedures. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1553-1561 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1219262 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1219262 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1553-1561 Template-Type: ReDIF-Article 1.0 Author-Name: Wayne A. Fuller Author-X-Name-First: Wayne A. Author-X-Name-Last: Fuller Author-Name: Jason C. Legg Author-X-Name-First: Jason C. Author-X-Name-Last: Legg Author-Name: Yang Li Author-X-Name-First: Yang Author-X-Name-Last: Li Title: Bootstrap Variance Estimation for Rejective Sampling Abstract: Replication procedures have proven useful for variance estimation for large scale complex surveys. As an extension of bootstrap procedures to rejective samples, we define a bootstrap sample that is a rejective, unequal probability, replacement sample selected from the original sample. A modification of the bootstrap with improved performance is suggested for stratified samples with small stratum sizes. Simulations for Poisson and stratified rejective samples support the use of replicates in estimating the variance of the regression estimator for rejective samples. Journal: Journal of the American Statistical Association Pages: 1562-1570 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1222285 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222285 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1562-1570 Template-Type: ReDIF-Article 1.0 Author-Name: Gongjun Xu Author-X-Name-First: Gongjun Author-X-Name-Last: Xu Author-Name: Tony Sit Author-X-Name-First: Tony Author-X-Name-Last: Sit Author-Name: Lan Wang Author-X-Name-First: Lan Author-X-Name-Last: Wang Author-Name: Chiung-Yu Huang Author-X-Name-First: Chiung-Yu Author-X-Name-Last: Huang Title: Estimation and Inference of Quantile Regression for Survival Data Under Biased Sampling Abstract: Biased sampling occurs frequently in economics, epidemiology, and medical studies either by design or due to data collecting mechanism. Failing to take into account the sampling bias usually leads to incorrect inference. We propose a unified estimation procedure and a computationally fast resampling method to make statistical inference for quantile regression with survival data under general biased sampling schemes, including but not limited to the length-biased sampling, the case-cohort design, and variants thereof. We establish the uniform consistency and weak convergence of the proposed estimator as a process of the quantile level. We also investigate more efficient estimation using the generalized method of moments and derive the asymptotic normality. We further propose a new resampling method for inference, which differs from alternative procedures in that it does not require to repeatedly solve estimating equations. It is proved that the resampling method consistently estimates the asymptotic covariance matrix. The unified framework proposed in this article provides researchers and practitioners a convenient tool for analyzing data collected from various designs. Simulation studies and applications to real datasets are presented for illustration. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1571-1586 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1222286 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222286 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1571-1586 Template-Type: ReDIF-Article 1.0 Author-Name: Kyle R. White Author-X-Name-First: Kyle R. Author-X-Name-Last: White Author-Name: Leonard A. Stefanski Author-X-Name-First: Leonard A. Author-X-Name-Last: Stefanski Author-Name: Yichao Wu Author-X-Name-First: Yichao Author-X-Name-Last: Wu Title: Variable Selection in Kernel Regression Using Measurement Error Selection Likelihoods Abstract: This article develops a nonparametric shrinkage and selection estimator via the measurement error selection likelihood approach recently proposed by Stefanski, Wu, and White. The measurement error kernel regression operator (MEKRO) has the same form as the Nadaraya–Watson kernel estimator, but optimizes a measurement error model selection likelihood to estimate the kernel bandwidths. Much like LASSO or COSSO solution paths, MEKRO results in solution paths depending on a tuning parameter that controls shrinkage and selection via a bound on the harmonic mean of the pseudo-measurement error standard deviations. We use small-sample-corrected AIC to select the tuning parameter. Large-sample properties of MEKRO are studied and small-sample properties are explored via Monte Carlo experiments and applications to data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1587-1597 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1222287 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222287 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1587-1597 Template-Type: ReDIF-Article 1.0 Author-Name: Michalis K. Titsias Author-X-Name-First: Michalis K. Author-X-Name-Last: Titsias Author-Name: Christopher Yau Author-X-Name-First: Christopher Author-X-Name-Last: Yau Title: The Hamming Ball Sampler Abstract: We introduce the Hamming ball sampler, a novel Markov chain Monte Carlo algorithm, for efficient inference in statistical models involving high-dimensional discrete state spaces. The sampling scheme uses an auxiliary variable construction that adaptively truncates the model space allowing iterative exploration of the full model space. The approach generalizes conventional Gibbs sampling schemes for discrete spaces and provides an intuitive means for user-controlled balance between statistical efficiency and computational tractability. We illustrate the generic utility of our sampling algorithm through application to a range of statistical models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1598-1611 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1222288 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222288 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1598-1611 Template-Type: ReDIF-Article 1.0 Author-Name: Xu He Author-X-Name-First: Xu Author-X-Name-Last: He Title: Rotated Sphere Packing Designs Abstract: We propose a new class of space-filling designs called rotated sphere packing designs for computer experiments. The approach starts from the asymptotically optimal positioning of identical balls that covers the unit cube. Properly scaled, rotated, translated, and extracted, such designs are excellent in maximin distance criterion, low in discrepancy, good in projective uniformity and thus useful in both prediction and numerical integration purposes. We provide a fast algorithm to construct such designs for any numbers of dimensions and points with R codes available online. Theoretical and numerical results are also provided. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1612-1622 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1222289 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222289 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1612-1622 Template-Type: ReDIF-Article 1.0 Author-Name: Mingyuan Zhou Author-X-Name-First: Mingyuan Author-X-Name-Last: Zhou Author-Name: Stefano Favaro Author-X-Name-First: Stefano Author-X-Name-Last: Favaro Author-Name: Stephen G Walker Author-X-Name-First: Stephen G Author-X-Name-Last: Walker Title: Frequency of Frequencies Distributions and Size-Dependent Exchangeable Random Partitions Abstract: Motivated by the fundamental problem of modeling the frequency of frequencies (FoF) distribution, this article introduces the concept of a cluster structure to define a probability function that governs the joint distribution of a random count and its exchangeable random partitions. A cluster structure, naturally arising from a completely random measure mixed Poisson process, allows the probability distribution of the random partitions of a subset of a population to be dependent on the population size, a distinct and motivated feature that makes it more flexible than a partition structure. This allows it to model an entire FoF distribution whose structural properties change as the population size varies. An FoF vector can be simulated by drawing an infinite number of Poisson random variables, or by a stick-breaking construction with a finite random number of steps. A generalized negative binomial process model is proposed to generate a cluster structure, where in the prior the number of clusters is finite and Poisson distributed, and the cluster sizes follow a truncated negative binomial distribution. We propose a simple Gibbs sampling algorithm to extrapolate the FoF vector of a population given the FoF vector of a sample taken without replacement from the population. We illustrate our results and demonstrate the advantages of the proposed models through the analysis of real text, genomic, and survey data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1623-1635 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1222290 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222290 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1623-1635 Template-Type: ReDIF-Article 1.0 Author-Name: Pieralberto Guarniero Author-X-Name-First: Pieralberto Author-X-Name-Last: Guarniero Author-Name: Adam M. Johansen Author-X-Name-First: Adam M. Author-X-Name-Last: Johansen Author-Name: Anthony Lee Author-X-Name-First: Anthony Author-X-Name-Last: Lee Title: The Iterated Auxiliary Particle Filter Abstract: We present an offline, iterated particle filter to facilitate statistical inference in general state space hidden Markov models. Given a model and a sequence of observations, the associated marginal likelihood L is central to likelihood-based inference for unknown statistical parameters. We define a class of “twisted” models: each member is specified by a sequence of positive functions ψ${\bm \psi }$ and has an associated ψ${\bm \psi }$-auxiliary particle filter that provides unbiased estimates of L. We identify a sequence ψ*${\bm \psi }^{*}$ that is optimal in the sense that the ψ*${\bm \psi }^{*}$-auxiliary particle filter’s estimate of L has zero variance. In practical applications, ψ*${\bm \psi }^{*}$ is unknown so the ψ*${\bm \psi }^{*}$-auxiliary particle filter cannot straightforwardly be implemented. We use an iterative scheme to approximate ψ*${\bm \psi }^{*}$ and demonstrate empirically that the resulting iterated auxiliary particle filter significantly outperforms the bootstrap particle filter in challenging settings. Applications include parameter estimation using a particle Markov chain Monte Carlo algorithm. Journal: Journal of the American Statistical Association Pages: 1636-1647 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1222291 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222291 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1636-1647 Template-Type: ReDIF-Article 1.0 Author-Name: Shujie Ma Author-X-Name-First: Shujie Author-X-Name-Last: Ma Author-Name: Yanyuan Ma Author-X-Name-First: Yanyuan Author-X-Name-Last: Ma Author-Name: Yanqing Wang Author-X-Name-First: Yanqing Author-X-Name-Last: Wang Author-Name: Eli S. Kravitz Author-X-Name-First: Eli S. Author-X-Name-Last: Kravitz Author-Name: Raymond J. Carroll Author-X-Name-First: Raymond J. Author-X-Name-Last: Carroll Title: A Semiparametric Single-Index Risk Score Across Populations Abstract: We consider a problem motivated by issues in nutritional epidemiology, across diseases and populations. In this area, it is becoming increasingly common for diseases to be modeled by a single diet score, such as the Healthy Eating Index, the Mediterranean Diet Score, etc. For each disease and for each population, a partially linear single-index model is fit. The partially linear aspect of the problem is allowed to differ in each population and disease. However, and crucially, the single-index itself, having to do with the diet score, is common to all diseases and populations, and the nonparametrically estimated functions of the single-index are the same up to a scale parameter. Using B-splines with an increasing number of knots, we develop a method to solve the problem, and display its asymptotic theory. An application to the NIH-AARP Study of Diet and Health is described, where we show the advantages of using multiple diseases and populations simultaneously rather than one at a time in understanding the effect of increased Milk consumption. Simulations illustrate the properties of the methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1648-1662 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1222944 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222944 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1648-1662 Template-Type: ReDIF-Article 1.0 Author-Name: Susanne M. Schennach Author-X-Name-First: Susanne M. Author-X-Name-Last: Schennach Author-Name: Daniel Wilhelm Author-X-Name-First: Daniel Author-X-Name-Last: Wilhelm Title: A Simple Parametric Model Selection Test Abstract: We propose a simple model selection test for choosing among two parametric likelihoods, which can be applied in the most general setting without any assumptions on the relation between the candidate models and the true distribution. That is, both, one or neither is allowed to be correctly specified or misspecified, they may be nested, nonnested, strictly nonnested, or overlapping. Unlike in previous testing approaches, no pretesting is needed, since in each case, the same test statistic together with a standard normal critical value can be used. The new procedure controls asymptotic size uniformly over a large class of data-generating processes. We demonstrate its finite sample properties in a Monte Carlo experiment and its practical relevance in an empirical application comparing Keynesian versus new classical macroeconomic models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1663-1674 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1224716 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1224716 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1663-1674 Template-Type: ReDIF-Article 1.0 Author-Name: Yong-Dao Zhou Author-X-Name-First: Yong-Dao Author-X-Name-Last: Zhou Author-Name: Hongquan Xu Author-X-Name-First: Hongquan Author-X-Name-Last: Xu Title: Composite Designs Based on Orthogonal Arrays and Definitive Screening Designs Abstract: Central composite designs are widely used in practice for factor screening and building response surface models. We study two classes of new composite designs. The first class consists of a two-level factorial design and a three-level orthogonal array; the second consists of a two-level factorial and a three-level definitive screening design. We derive bounds of their efficiencies for estimating all and part of the parameters in a second-order model and obtain some general theoretical results. New composite designs are constructed. They are more efficient than central composite designs and other existing designs. Supplementary materials are available online. Journal: Journal of the American Statistical Association Pages: 1675-1683 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1228535 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1228535 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1675-1683 Template-Type: ReDIF-Article 1.0 Author-Name: Yen-Chi Chen Author-X-Name-First: Yen-Chi Author-X-Name-Last: Chen Author-Name: Christopher R. Genovese Author-X-Name-First: Christopher R. Author-X-Name-Last: Genovese Author-Name: Larry Wasserman Author-X-Name-First: Larry Author-X-Name-Last: Wasserman Title: Density Level Sets: Asymptotics, Inference, and Visualization Abstract: We study the plug-in estimator for density level sets under Hausdorff loss. We derive asymptotic theory for this estimator, and based on this theory, we develop two bootstrap confidence regions for level sets. We introduce a new technique for visualizing density level sets, even in multidimensions, which is easy to interpret and efficient to compute. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1684-1696 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1228536 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1228536 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1684-1696 Template-Type: ReDIF-Article 1.0 Author-Name: Shizhe Chen Author-X-Name-First: Shizhe Author-X-Name-Last: Chen Author-Name: Ali Shojaie Author-X-Name-First: Ali Author-X-Name-Last: Shojaie Author-Name: Daniela M. Witten Author-X-Name-First: Daniela M. Author-X-Name-Last: Witten Title: Network Reconstruction From High-Dimensional Ordinary Differential Equations Abstract: We consider the task of learning a dynamical system from high-dimensional time-course data. For instance, we might wish to estimate a gene regulatory network from gene expression data measured at discrete time points. We model the dynamical system nonparametrically as a system of additive ordinary differential equations. Most existing methods for parameter estimation in ordinary differential equations estimate the derivatives from noisy observations. This is known to be challenging and inefficient. We propose a novel approach that does not involve derivative estimation. We show that the proposed method can consistently recover the true network structure even in high dimensions, and we demonstrate empirical improvement over competing approaches. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1697-1707 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1229197 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1229197 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1697-1707 Template-Type: ReDIF-Article 1.0 Author-Name: Daniel Manrique-Vallier Author-X-Name-First: Daniel Author-X-Name-Last: Manrique-Vallier Author-Name: Jerome P. Reiter Author-X-Name-First: Jerome P. Author-X-Name-Last: Reiter Title: Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data Abstract: In categorical data, it is typically the case that some combinations of variables are theoretically impossible, such as a 3-year-old child who is married or a man who is pregnant. In practice, however, reported values often include such structural zeros due to, for example, respondent mistakes or data processing errors. To purge data of such errors, many statistical organizations use a process known as edit-imputation. The basic idea is first to select reported values to change according to some heuristic or loss function, and second to replace those values with plausible imputations. This two-stage process typically does not fully use information in the data when determining locations of errors, nor does it appropriately reflect uncertainty resulting from the edits and imputations. We present an alternative approach to editing and imputation for categorical microdata with structural zeros that addresses these shortcomings. Specifically, we use a Bayesian hierarchical model that couples a stochastic model for the measurement error process with a Dirichlet process mixture of multinomial distributions for the underlying, error-free values. The latter model is restricted to have support only on the set of theoretically possible combinations. We illustrate this integrated approach to editing and imputation using simulation studies with data from the 2000 U. S. census, and compare it to a two-stage edit-imputation routine. Supplementary material is available online. Journal: Journal of the American Statistical Association Pages: 1708-1719 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1231612 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1231612 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1708-1719 Template-Type: ReDIF-Article 1.0 Author-Name: Taisuke Otsu Author-X-Name-First: Taisuke Author-X-Name-Last: Otsu Author-Name: Yoshiyasu Rai Author-X-Name-First: Yoshiyasu Author-X-Name-Last: Rai Title: Bootstrap Inference of Matching Estimators for Average Treatment Effects Abstract: It is known that the naive bootstrap is not asymptotically valid for a matching estimator of the average treatment effect with a fixed number of matches. In this article, we propose asymptotically valid inference methods for matching estimators based on the weighted bootstrap. The key is to construct bootstrap counterparts by resampling based on certain linear forms of the estimators. Our weighted bootstrap is applicable for the matching estimators of both the average treatment effect and its counterpart for the treated population. Also, by incorporating a bias correction method in Abadie and Imbens (2011), our method can be asymptotically valid even for matching based on a vector of covariates. A simulation study indicates that the weighted bootstrap method is favorably comparable with the asymptotic normal approximation. As an empirical illustration, we apply the proposed method to the National Supported Work data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1720-1732 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1231613 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1231613 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1720-1732 Template-Type: ReDIF-Article 1.0 Author-Name: Karthik Bharath Author-X-Name-First: Karthik Author-X-Name-Last: Bharath Author-Name: Prabhanjan Kambadur Author-X-Name-First: Prabhanjan Author-X-Name-Last: Kambadur Author-Name: Dipak. K. Dey Author-X-Name-First: Dipak. K. Author-X-Name-Last: Dey Author-Name: Arvind Rao Author-X-Name-First: Arvind Author-X-Name-Last: Rao Author-Name: Veerabhadran Baladandayuthapani Author-X-Name-First: Veerabhadran Author-X-Name-Last: Baladandayuthapani Title: Statistical Tests for Large Tree-Structured Data Abstract: We develop a general statistical framework for the analysis and inference of large tree-structured data, with a focus on developing asymptotic goodness-of-fit tests. We first propose a consistent statistical model for binary trees, from which we develop a class of invariant tests. Using the model for binary trees, we then construct tests for general trees by using the distributional properties of the continuum random tree, which arises as the invariant limit for a broad class of models for tree-structured data based on conditioned Galton–Watson processes. The test statistics for the goodness-of-fit tests are simple to compute and are asymptotically distributed as χ2 and F random variables. We illustrate our methods on an important application of detecting tumor heterogeneity in brain cancer. We use a novel approach with tree-based representations of magnetic resonance images and employ the developed tests to ascertain tumor heterogeneity between two groups of patients. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1733-1743 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1240081 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240081 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1733-1743 Template-Type: ReDIF-Article 1.0 Author-Name: Yacine Aït-Sahalia Author-X-Name-First: Yacine Author-X-Name-Last: Aït-Sahalia Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Roger J. A. Laeven Author-X-Name-First: Roger J. A. Author-X-Name-Last: Laeven Author-Name: Christina Dan Wang Author-X-Name-First: Christina Dan Author-X-Name-Last: Wang Author-Name: Xiye Yang Author-X-Name-First: Xiye Author-X-Name-Last: Yang Title: Estimation of the Continuous and Discontinuous Leverage Effects Abstract: This article examines the leverage effect, or the generally negative covariation between asset returns and their changes in volatility, under a general setup that allows the log-price and volatility processes to be Itô semimartingales. We decompose the leverage effect into continuous and discontinuous parts and develop statistical methods to estimate them. We establish the asymptotic properties of these estimators. We also extend our methods and results (for the continuous leverage) to the situation where there is market microstructure noise in the observed returns. We show in Monte Carlo simulations that our estimators have good finite sample performance. When applying our methods to real data, our empirical results provide convincing evidence of the presence of the two leverage effects, especially the discontinuous one. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1744-1758 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2016.1240082 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240082 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1744-1758 Template-Type: ReDIF-Article 1.0 Author-Name: Wei Liu Author-X-Name-First: Wei Author-X-Name-Last: Liu Author-Name: Zhiwei Zhang Author-X-Name-First: Zhiwei Author-X-Name-Last: Zhang Author-Name: Lei Nie Author-X-Name-First: Lei Author-X-Name-Last: Nie Author-Name: Guoxing Soon Author-X-Name-First: Guoxing Author-X-Name-Last: Soon Title: A Case Study in Personalized Medicine: Rilpivirine Versus Efavirenz for Treatment-Naive HIV Patients Abstract: Rilpivirine and efavirenz are two major nonnucleoside reverse transcriptase inhibitors currently available in the U.S. for treatment-naive adult patients infected with human immunodeficiency virus (HIV). Two randomized clinical trials comparing the two drugs suggested that their relative efficacy may depend on baseline viral load and CD4 cell count. This article is concerned with the potential utilities of these biomarkers in developing individualized treatment regimes that attempt to maximize the virologic response rate or the median of a composite outcome that combines virologic response with change in CD4 cell count (dCD4). Working with the median composite outcome removes the need to assign numerical values to the composite outcome, as would be necessary if we were to maximize its mean, and reduces the influence of extreme dCD4 values. To estimate the target quantities for a given treatment regime, we use G-computation, inverse probability weighting (IPW), and augmented IPW methods to deal with censoring and missing data under a monotone coarsening framework. The resulting estimates form the basis for optimization in a class of candidate regimes indexed by a small number of parameters. A cross-validation procedure is used to remove the resubstitution bias in evaluating an optimized treatment regime. Application of these methods to the HIV trial data yields candidate regimes of different forms together with cross-validated performance measure estimates, which suggest that optimized treatment regimes may be able to improve virologic response (but not the composite outcome) over uniform regimes that prescribe one drug for all patients. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1381-1392 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1280404 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1280404 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1381-1392 Template-Type: ReDIF-Article 1.0 Author-Name: Chuan Hong Author-X-Name-First: Chuan Author-X-Name-Last: Hong Author-Name: Yang Ning Author-X-Name-First: Yang Author-X-Name-Last: Ning Author-Name: Shuang Wang Author-X-Name-First: Shuang Author-X-Name-Last: Wang Author-Name: Hao Wu Author-X-Name-First: Hao Author-X-Name-Last: Wu Author-Name: Raymond J. Carroll Author-X-Name-First: Raymond J. Author-X-Name-Last: Carroll Author-Name: Yong Chen Author-X-Name-First: Yong Author-X-Name-Last: Chen Title: PLEMT: A Novel Pseudolikelihood-Based EM Test for Homogeneity in Generalized Exponential Tilt Mixture Models Abstract: Motivated by analyses of DNA methylation data, we propose a semiparametric mixture model, namely, the generalized exponential tilt mixture model, to account for heterogeneity between differentially methylated and nondifferentially methylated subjects in the cancer group, and capture the differences in higher order moments (e.g., mean and variance) between subjects in cancer and normal groups. A pairwise pseudolikelihood is constructed to eliminate the unknown nuisance function. To circumvent boundary and nonidentifiability problems as in parametric mixture models, we modify the pseudolikelihood by adding a penalty function. In addition, the test with simple asymptotic distribution has computational advantages compared with permutation-based test for high-dimensional genetic or epigenetic data. We propose a pseudolikelihood-based expectation–maximization test, and show the proposed test follows a simple chi-squared limiting distribution. Simulation studies show that the proposed test controls Type I errors well and has better power compared to several current tests. In particular, the proposed test outperforms the commonly used tests under all simulation settings considered, especially when there are variance differences between two groups. The proposed test is applied to a real dataset to identify differentially methylated sites between ovarian cancer subjects and normal subjects. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1393-1404 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1280405 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1280405 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1393-1404 Template-Type: ReDIF-Article 1.0 Author-Name: Robert T. Krafty Author-X-Name-First: Robert T. Author-X-Name-Last: Krafty Author-Name: Ori Rosen Author-X-Name-First: Ori Author-X-Name-Last: Rosen Author-Name: David S. Stoffer Author-X-Name-First: David S. Author-X-Name-Last: Stoffer Author-Name: Daniel J. Buysse Author-X-Name-First: Daniel J. Author-X-Name-Last: Buysse Author-Name: Martica H. Hall Author-X-Name-First: Martica H. Author-X-Name-Last: Hall Title: Conditional Spectral Analysis of Replicated Multiple Time Series With Application to Nocturnal Physiology Abstract: This article considers the problem of analyzing associations between power spectra of multiple time series and cross-sectional outcomes when data are observed from multiple subjects. The motivating application comes from sleep medicine, where researchers are able to noninvasively record physiological time series signals during sleep. The frequency patterns of these signals, which can be quantified through the power spectrum, contain interpretable information about biological processes. An important problem in sleep research is drawing connections between power spectra of time series signals and clinical characteristics; these connections are key to understanding biological pathways through which sleep affects, and can be treated to improve, health. Such analyses are challenging as they must overcome the complicated structure of a power spectrum from multiple time series as a complex positive-definite matrix-valued function. This article proposes a new approach to such analyses based on a tensor-product spline model of Cholesky components of outcome-dependent power spectra. The approach flexibly models power spectra as nonparametric functions of frequency and outcome while preserving geometric constraints. Formulated in a fully Bayesian framework, a Whittle likelihood-based Markov chain Monte Carlo (MCMC) algorithm is developed for automated model fitting and for conducting inference on associations between outcomes and spectral measures. The method is used to analyze data from a study of sleep in older adults and uncovers new insights into how stress and arousal are connected to the amount of time one spends in bed. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1405-1416 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1281811 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1281811 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1405-1416 Template-Type: ReDIF-Article 1.0 Author-Name: Xiaodong Li Author-X-Name-First: Xiaodong Author-X-Name-Last: Li Author-Name: Xu He Author-X-Name-First: Xu Author-X-Name-Last: He Author-Name: Yuanzhen He Author-X-Name-First: Yuanzhen Author-X-Name-Last: He Author-Name: Hui Zhang Author-X-Name-First: Hui Author-X-Name-Last: Zhang Author-Name: Zhong Zhang Author-X-Name-First: Zhong Author-X-Name-Last: Zhang Author-Name: Dennis K. J. Lin Author-X-Name-First: Dennis K. J. Author-X-Name-Last: Lin Title: The Design and Analysis for the Icing Wind Tunnel Experiment of a New Deicing Coating Abstract: A new kind of deicing coating is developed to provide aircraft with efficient and durable protection from icing-induced dangers. The icing wind tunnel experiment is indispensable in confirming the usefulness of a deicing coating. Due to the high cost of each batch relative to the available budget, an efficient design of the icing wind tunnel experiment is crucial. The challenges in designing this experiment are multi-fold. It involves between-block factors and within-block factors, incomplete blocking with random effects, related factors, hard-to-change factors, and nuisance factors. Traditional designs and theories cannot be directly applied. To overcome these challenges, we propose using a step-by-step design strategy that includes applying a cross array structure for between-block factors and within-block factors, a group of balanced conditions for optimizing incomplete blocking, a run order method to achieve the minimum number of level changes for hard-to-change factors, and a zero aliased matrix for the nuisance factors. New (theoretical) results for D-optimal design of incomplete blocking experiments with random block effects and minimum number of level changes are obtained. Results of the experiments show that this novel deicing coating is promising in offering both high efficiency of ice reduction and a long service lifetime. The methodology proposed here is generalizable to other applications that involve nonstandard design problems. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1417-1429 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1281812 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1281812 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1417-1429 Template-Type: ReDIF-Article 1.0 Author-Name: Boyu Ren Author-X-Name-First: Boyu Author-X-Name-Last: Ren Author-Name: Sergio Bacallado Author-X-Name-First: Sergio Author-X-Name-Last: Bacallado Author-Name: Stefano Favaro Author-X-Name-First: Stefano Author-X-Name-Last: Favaro Author-Name: Susan Holmes Author-X-Name-First: Susan Author-X-Name-Last: Holmes Author-Name: Lorenzo Trippa Author-X-Name-First: Lorenzo Author-X-Name-Last: Trippa Title: Bayesian Nonparametric Ordination for the Analysis of Microbial Communities Abstract: Human microbiome studies use sequencing technologies to measure the abundance of bacterial species or Operational Taxonomic Units (OTUs) in samples of biological material. Typically the data are organized in contingency tables with OTU counts across heterogeneous biological samples. In the microbial ecology community, ordination methods are frequently used to investigate latent factors or clusters that capture and describe variations of OTU counts across biological samples. It remains important to evaluate how uncertainty in estimates of each biological sample’s microbial distribution propagates to ordination analyses, including visualization of clusters and projections of biological samples on low-dimensional spaces. We propose a Bayesian analysis for dependent distributions to endow frequently used ordinations with estimates of uncertainty. A Bayesian nonparametric prior for dependent normalized random measures is constructed, which is marginally equivalent to the normalized generalized Gamma process, a well-known prior for nonparametric analyses. In our prior, the dependence and similarity between microbial distributions is represented by latent factors that concentrate in a low-dimensional space. We use a shrinkage prior to tune the dimensionality of the latent factors. The resulting posterior samples of model parameters can be used to evaluate uncertainty in analyses routinely applied in microbiome studies. Specifically, by combining them with multivariate data analysis techniques we can visualize credible regions in ecological ordination plots. The characteristics of the proposed model are illustrated through a simulation study and applications in two microbiome datasets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1430-1442 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1288631 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1288631 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1430-1442 Template-Type: ReDIF-Article 1.0 Author-Name: Caleb H. Miles Author-X-Name-First: Caleb H. Author-X-Name-Last: Miles Author-Name: Ilya Shpitser Author-X-Name-First: Ilya Author-X-Name-Last: Shpitser Author-Name: Phyllis Kanki Author-X-Name-First: Phyllis Author-X-Name-Last: Kanki Author-Name: Seema Meloni Author-X-Name-First: Seema Author-X-Name-Last: Meloni Author-Name: Eric J. Tchetgen Tchetgen Author-X-Name-First: Eric J. Author-X-Name-Last: Tchetgen Tchetgen Title: Quantifying an Adherence Path-Specific Effect of Antiretroviral Therapy in the Nigeria PEPFAR Program Abstract: Since the early 2000s, evidence has accumulated for a significant differential effect of first-line antiretroviral therapy (ART) regimens on human immunodeficiency virus (HIV) viral load suppression. This finding was replicated in our data from the Harvard President’s Emergency Plan for AIDS Relief (PEPFAR) program in Nigeria. Investigators were interested in finding the source of these differences, that is, understanding the mechanisms through which one regimen outperforms another, particularly via adherence. This question can be naturally formulated via mediation analysis with adherence playing the role of a mediator. Existing mediation analysis results, however, have relied on an assumption of no exposure-induced confounding of the intermediate variable, and generally require an assumption of no unmeasured confounding for nonparametric identification. Both assumptions are violated by the presence of drug toxicity. In this article, we relax these assumptions and show that certain path-specific effects remain identified under weaker conditions. We focus on the path-specific effect solely mediated by adherence and not by toxicity and propose an estimator for this effect. We illustrate with simulations and present results from a study applying the methodology to the Harvard PEPFAR data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1443-1452 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1295862 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1295862 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1443-1452 Template-Type: ReDIF-Article 1.0 Author-Name: K. Sham Bhat Author-X-Name-First: K. Sham Author-X-Name-Last: Bhat Author-Name: David S. Mebane Author-X-Name-First: David S. Author-X-Name-Last: Mebane Author-Name: Priyadarshi Mahapatra Author-X-Name-First: Priyadarshi Author-X-Name-Last: Mahapatra Author-Name: Curtis B. Storlie Author-X-Name-First: Curtis B. Author-X-Name-Last: Storlie Title: Upscaling Uncertainty with Dynamic Discrepancy for a Multi-Scale Carbon Capture System Abstract: Uncertainties from model parameters and model discrepancy from small-scale models impact the accuracy and reliability of predictions of large-scale systems. Inadequate representation of these uncertainties may result in inaccurate and overconfident predictions during scale-up to larger systems. Hence, multiscale modeling efforts must accurately quantify the effect of the propagation of uncertainties during upscaling. Using a Bayesian approach, we calibrate a small-scale solid sorbent model to thermogravimetric (TGA) data on a functional profile using chemistry-based priors. Crucial to this effort is the representation of model discrepancy, which uses a Bayesian smoothing splines (BSS-ANOVA) framework. Our uncertainty quantification (UQ) approach could be considered intrusive as it includes the discrepancy function within the chemical rate expressions; resulting in a set of stochastic differential equations. Such an approach allows for easily propagating uncertainty by propagating the joint model parameter and discrepancy posterior into the larger-scale system of rate expressions. The broad UQ framework presented here could be applicable to virtually all areas of science where multiscale modeling is used. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1453-1467 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1295863 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1295863 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1453-1467 Template-Type: ReDIF-Article 1.0 Author-Name: Ran Tao Author-X-Name-First: Ran Author-X-Name-Last: Tao Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Dan-Yu Lin Author-X-Name-First: Dan-Yu Author-X-Name-Last: Lin Title: Efficient Semiparametric Inference Under Two-Phase Sampling, With Applications to Genetic Association Studies Abstract: In modern epidemiological and clinical studies, the covariates of interest may involve genome sequencing, biomarker assay, or medical imaging and thus are prohibitively expensive to measure on a large number of subjects. A cost-effective solution is the two-phase design, under which the outcome and inexpensive covariates are observed for all subjects during the first phase and that information is used to select subjects for measurements of expensive covariates during the second phase. For example, subjects with extreme values of quantitative traits were selected for whole-exome sequencing in the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project (ESP). Herein, we consider general two-phase designs, where the outcome can be continuous or discrete, and inexpensive covariates can be continuous and correlated with expensive covariates. We propose a semiparametric approach to regression analysis by approximating the conditional density functions of expensive covariates given inexpensive covariates with B-spline sieves. We devise a computationally efficient and numerically stable EM-algorithm to maximize the sieve likelihood. In addition, we establish the consistency, asymptotic normality, and asymptotic efficiency of the estimators. Furthermore, we demonstrate the superiority of the proposed methods over existing ones through extensive simulation studies. Finally, we present applications to the aforementioned NHLBI ESP. Supplementary materials for this article are available online Journal: Journal of the American Statistical Association Pages: 1468-1476 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1295864 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1295864 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1468-1476 Template-Type: ReDIF-Article 1.0 Author-Name: Xinran Li Author-X-Name-First: Xinran Author-X-Name-Last: Li Author-Name: Peng Ding Author-X-Name-First: Peng Author-X-Name-Last: Ding Title: General Forms of Finite Population Central Limit Theorems with Applications to Causal Inference Abstract: Frequentists’ inference often delivers point estimators associated with confidence intervals or sets for parameters of interest. Constructing the confidence intervals or sets requires understanding the sampling distributions of the point estimators, which, in many but not all cases, are related to asymptotic Normal distributions ensured by central limit theorems. Although previous literature has established various forms of central limit theorems for statistical inference in super population models, we still need general and convenient forms of central limit theorems for some randomization-based causal analyses of experimental data, where the parameters of interests are functions of a finite population and randomness comes solely from the treatment assignment. We use central limit theorems for sample surveys and rank statistics to establish general forms of the finite population central limit theorems that are particularly useful for proving asymptotic distributions of randomization tests under the sharp null hypothesis of zero individual causal effects, and for obtaining the asymptotic repeated sampling distributions of the causal effect estimators. The new central limit theorems hold for general experimental designs with multiple treatment levels, multiple treatment factors and vector outcomes, and are immediately applicable for studying the asymptotic properties of many methods in causal inference, including instrumental variable, regression adjustment, rerandomization, cluster-randomized experiments, and so on. Previously, the asymptotic properties of these problems are often based on heuristic arguments, which in fact rely on general forms of finite population central limit theorems that have not been established before. Our new theorems fill this gap by providing more solid theoretical foundation for asymptotic randomization-based causal inference. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1759-1769 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1295865 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1295865 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1759-1769 Template-Type: ReDIF-Article 1.0 Author-Name: D. L. Oberski Author-X-Name-First: D. L. Author-X-Name-Last: Oberski Author-Name: A. Kirchner Author-X-Name-First: A. Author-X-Name-Last: Kirchner Author-Name: S. Eckman Author-X-Name-First: S. Author-X-Name-Last: Eckman Author-Name: F. Kreuter Author-X-Name-First: F. Author-X-Name-Last: Kreuter Title: Evaluating the Quality of Survey and Administrative Data with Generalized Multitrait-Multimethod Models Abstract: Administrative data are increasingly important in statistics, but, like other types of data, may contain measurement errors. To prevent such errors from invalidating analyses of scientific interest, it is therefore essential to estimate the extent of measurement errors in administrative data. Currently, however, most approaches to evaluate such errors involve either prohibitively expensive audits or comparison with a survey that is assumed perfect. We introduce the “generalized multitrait-multimethod” (GMTMM) model, which can be seen as a general framework for evaluating the quality of administrative and survey data simultaneously. This framework allows both survey and administrative data to contain random and systematic measurement errors. Moreover, it accommodates common features of administrative data such as discreteness, nonlinearity, and nonnormality, improving similar existing models. The use of the GMTMM model is demonstrated by application to linked survey-administrative data from the German Federal Employment Agency on income from of employment, and a simulation study evaluates the estimates obtained and their robustness to model misspecification. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1477-1489 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1302338 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1302338 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1477-1489 Template-Type: ReDIF-Article 1.0 Author-Name: Siem Jan Koopman Author-X-Name-First: Siem Jan Author-X-Name-Last: Koopman Author-Name: Rutger Lit Author-X-Name-First: Rutger Author-X-Name-Last: Lit Author-Name: André Lucas Author-X-Name-First: André Author-X-Name-Last: Lucas Title: Intraday Stochastic Volatility in Discrete Price Changes: The Dynamic Skellam Model Abstract: We study intraday stochastic volatility for four liquid stocks traded on the New York Stock Exchange using a new dynamic Skellam model for high-frequency tick-by-tick discrete price changes. Since the likelihood function is analytically intractable, we rely on numerical methods for its evaluation. Given the high number of observations per series per day (1000 to 10,000), we adopt computationally efficient methods including Monte Carlo integration. The intraday dynamics of volatility and the high number of trades without price impact require nontrivial adjustments to the basic dynamic Skellam model. In-sample residual diagnostics and goodness-of-fit statistics show that the final model provides a good fit to the data. An extensive day-to-day forecasting study of intraday volatility shows that the dynamic modified Skellam model provides accurate forecasts compared to alternative modeling approaches. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1490-1503 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1302878 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1302878 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1490-1503 Template-Type: ReDIF-Article 1.0 Author-Name: Michel H. Hof Author-X-Name-First: Michel H. Author-X-Name-Last: Hof Author-Name: Anita C. Ravelli Author-X-Name-First: Anita C. Author-X-Name-Last: Ravelli Author-Name: Aeilko H. Zwinderman Author-X-Name-First: Aeilko H. Author-X-Name-Last: Zwinderman Title: A Probabilistic Record Linkage Model for Survival Data Abstract: In the absence of a unique identifier, combining information from multiple files relies on partially identifying variables (e.g., gender, initials). With a record linkage procedure, these variables are used to distinguish record pairs that belong together (matches) from record pairs that do not belong together (nonmatches). Generally, the combined strength of the partially identifying variables is too low causing imperfect linkage; some true nonmatches are identified as match and, on the other hand, some true matches as nonmatch. To avoid bias in further analyses, it is necessary to correct for imperfect linkage. In this article, pregnancy data from the Perinatal Registry of the Netherlands were used to estimate the associations between the (baseline) characteristics from the first delivery and the time to a second delivery. Because of privacy regulations, no unique identifier was available to determine which pregnancies belonged to the same woman. To deal with imperfect linkage in a time-to-event setting, where we have a file with baseline characteristics and a file with event times, we developed a joint model in which the record linkage procedure and the time-to-event analysis are performed simultaneously. R code and example data are available as online supplemental material. Journal: Journal of the American Statistical Association Pages: 1504-1515 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1311262 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1311262 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1504-1515 Template-Type: ReDIF-Article 1.0 Author-Name: Scott W. Linderman Author-X-Name-First: Scott W. Author-X-Name-Last: Linderman Author-Name: David M. Blei Author-X-Name-First: David M. Author-X-Name-Last: Blei Title: Comment: A Discussion of “Nonparametric Bayes Modeling of Populations of Networks” Journal: Journal of the American Statistical Association Pages: 1543-1547 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1388244 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1388244 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1543-1547 Template-Type: ReDIF-Article 1.0 Author-Name: Nicholas J. Foti Author-X-Name-First: Nicholas J. Author-X-Name-Last: Foti Author-Name: Emily B. Fox Author-X-Name-First: Emily B. Author-X-Name-Last: Fox Title: Comment: Nonparametric Bayes Modeling of Populations of Networks Journal: Journal of the American Statistical Association Pages: 1539-1543 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1388245 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1388245 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1539-1543 Template-Type: ReDIF-Article 1.0 Author-Name: Adrian E. Raftery Author-X-Name-First: Adrian E. Author-X-Name-Last: Raftery Title: Comment: Extending the Latent Position Model for Networks Journal: Journal of the American Statistical Association Pages: 1531-1534 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1389736 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1389736 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1531-1534 Template-Type: ReDIF-Article 1.0 Author-Name: Mark S. Handcock Author-X-Name-First: Mark S. Author-X-Name-Last: Handcock Title: Comment Journal: Journal of the American Statistical Association Pages: 1537-1539 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1389737 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1389737 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1537-1539 Template-Type: ReDIF-Article 1.0 Author-Name: Tamara Broderick Author-X-Name-First: Tamara Author-X-Name-Last: Broderick Title: Comment: Nonparametric Bayes Modeling of Populations of Networks Journal: Journal of the American Statistical Association Pages: 1534-1537 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1389738 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1389738 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1534-1537 Template-Type: ReDIF-Article 1.0 Author-Name: Samuel D. Pimentel Author-X-Name-First: Samuel D. Author-X-Name-Last: Pimentel Author-Name: Rachel R. Kelz Author-X-Name-First: Rachel R. Author-X-Name-Last: Kelz Author-Name: Jeffrey H. Silber Author-X-Name-First: Jeffrey H. Author-X-Name-Last: Silber Author-Name: Paul R. Rosenbaum Author-X-Name-First: Paul R. Author-X-Name-Last: Rosenbaum Title: Correction Journal: Journal of the American Statistical Association Pages: 1770-1770 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1395640 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1395640 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1770-1770 Template-Type: ReDIF-Article 1.0 Author-Name: Daniele Durante Author-X-Name-First: Daniele Author-X-Name-Last: Durante Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Author-Name: Joshua T. Vogelstein Author-X-Name-First: Joshua T. Author-X-Name-Last: Vogelstein Title: Rejoinder: Nonparametric Bayes Modeling of Populations of Networks Journal: Journal of the American Statistical Association Pages: 1547-1552 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1395643 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1395643 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1547-1552 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Editorial Collaborators Journal: Journal of the American Statistical Association Pages: 1784-1791 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1395645 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1395645 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1784-1791 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Editorial Board EOV Journal: Journal of the American Statistical Association Pages: ebi-ebi Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1400347 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1400347 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:ebi-ebi Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Book Reviews Journal: Journal of the American Statistical Association Pages: 1771-1783 Issue: 520 Volume: 112 Year: 2017 Month: 10 X-DOI: 10.1080/01621459.2017.1411709 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1411709 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1771-1783 Template-Type: ReDIF-Article 1.0 Author-Name: Xin Zhou Author-X-Name-First: Xin Author-X-Name-Last: Zhou Author-Name: Nicole Mayer-Hamblett Author-X-Name-First: Nicole Author-X-Name-Last: Mayer-Hamblett Author-Name: Umer Khan Author-X-Name-First: Umer Author-X-Name-Last: Khan Author-Name: Michael R. Kosorok Author-X-Name-First: Michael R. Author-X-Name-Last: Kosorok Title: Residual Weighted Learning for Estimating Individualized Treatment Rules Abstract: Personalized medicine has received increasing attention among statisticians, computer scientists, and clinical practitioners. A major component of personalized medicine is the estimation of individualized treatment rules (ITRs). Recently, Zhao et al. proposed outcome weighted learning (OWL) to construct ITRs that directly optimize the clinical outcome. Although OWL opens the door to introducing machine learning techniques to optimal treatment regimes, it still has some problems in performance. (1) The estimated ITR of OWL is affected by a simple shift of the outcome. (2) The rule from OWL tries to keep treatment assignments that subjects actually received. (3) There is no variable selection mechanism with OWL. All of them weaken the finite sample performance of OWL. In this article, we propose a general framework, called residual weighted learning (RWL), to alleviate these problems, and hence to improve finite sample performance. Unlike OWL which weights misclassification errors by clinical outcomes, RWL weights these errors by residuals of the outcome from a regression fit on clinical covariates excluding treatment assignment. We use the smoothed ramp loss function in RWL and provide a difference of convex (d.c.) algorithm to solve the corresponding nonconvex optimization problem. By estimating residuals with linear models or generalized linear models, RWL can effectively deal with different types of outcomes, such as continuous, binary, and count outcomes. We also propose variable selection methods for linear and nonlinear rules, respectively, to further improve the performance. We show that the resulting estimator of the treatment rule is consistent. We further obtain a rate of convergence for the difference between the expected outcome using the estimated ITR and that of the optimal treatment rule. The performance of the proposed RWL methods is illustrated in simulation studies and in an analysis of cystic fibrosis clinical trial data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 169-187 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2015.1093947 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1093947 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:169-187 Template-Type: ReDIF-Article 1.0 Author-Name: Qing Yang Author-X-Name-First: Qing Author-X-Name-Last: Yang Author-Name: Guangming Pan Author-X-Name-First: Guangming Author-X-Name-Last: Pan Title: Weighted Statistic in Detecting Faint and Sparse Alternatives for High-Dimensional Covariance Matrices Abstract: This article considers testing equality of two population covariance matrices when the data dimension p diverges with the sample size n (p/n → c > 0). We propose a weighted test statistic that is data-driven and powerful in both faint alternatives (many small disturbances) and sparse alternatives (several large disturbances). Its asymptotic null distribution is derived by large random matrix theory without assuming the existence of a limiting cumulative distribution function of the population covariance matrix. The simulation results confirm that our statistic is powerful against all alternatives, while other tests given in the literature fail in at least one situation. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 188-200 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2015.1122602 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1122602 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:188-200 Template-Type: ReDIF-Article 1.0 Author-Name: Matthias Katzfuss Author-X-Name-First: Matthias Author-X-Name-Last: Katzfuss Title: A Multi-Resolution Approximation for Massive Spatial Datasets Abstract: Automated sensing instruments on satellites and aircraft have enabled the collection of massive amounts of high-resolution observations of spatial fields over large spatial regions. If these datasets can be efficiently exploited, they can provide new insights on a wide variety of issues. However, traditional spatial-statistical techniques such as kriging are not computationally feasible for big datasets. We propose a multi-resolution approximation (M-RA) of Gaussian processes observed at irregular locations in space. The M-RA process is specified as a linear combination of basis functions at multiple levels of spatial resolution, which can capture spatial structure from very fine to very large scales. The basis functions are automatically chosen to approximate a given covariance function, which can be nonstationary. All computations involving the M-RA, including parameter inference and prediction, are highly scalable for massive datasets. Crucially, the inference algorithms can also be parallelized to take full advantage of large distributed-memory computing environments. In comparisons using simulated data and a large satellite dataset, the M-RA outperforms a related state-of-the-art method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 201-214 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2015.1123632 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1123632 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:201-214 Template-Type: ReDIF-Article 1.0 Author-Name: Thaís C. O. Fonseca Author-X-Name-First: Thaís C. O. Author-X-Name-Last: Fonseca Author-Name: Marco A. R. Ferreira Author-X-Name-First: Marco A. R. Author-X-Name-Last: Ferreira Title: Dynamic Multiscale Spatiotemporal Models for Poisson Data Abstract: We propose a new class of dynamic multiscale models for Poisson spatiotemporal processes. Specifically, we use a multiscale spatial Poisson factorization to decompose the Poisson process at each time point into spatiotemporal multiscale coefficients. We then connect these spatiotemporal multiscale coefficients through time with a novel Dirichlet evolution. Further, we propose a simulation-based full Bayesian posterior analysis. In particular, we develop filtering equations for updating of information forward in time and smoothing equations for integration of information backward in time, and use these equations to develop a forward filter backward sampler for the spatiotemporal multiscale coefficients. Because the multiscale coefficients are conditionally independent a posteriori, our full Bayesian posterior analysis is scalable, computationally efficient, and highly parallelizable. Moreover, the Dirichlet evolution of each spatiotemporal multiscale coefficient is parametrized by a discount factor that encodes the relevance of the temporal evolution of the spatiotemporal multiscale coefficient. Therefore, the analysis of discount factors provides a powerful way to identify regions with distinctive spatiotemporal dynamics. Finally, we illustrate the usefulness of our multiscale spatiotemporal Poisson methodology with two applications. The first application examines mortality ratios in the state of Missouri, and the second application considers tornado reports in the American Midwest. Journal: Journal of the American Statistical Association Pages: 215-234 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2015.1129968 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1129968 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:215-234 Template-Type: ReDIF-Article 1.0 Author-Name: Shaojun Guo Author-X-Name-First: Shaojun Author-X-Name-Last: Guo Author-Name: John Leigh Box Author-X-Name-First: John Leigh Author-X-Name-Last: Box Author-Name: Wenyang Zhang Author-X-Name-First: Wenyang Author-X-Name-Last: Zhang Title: A Dynamic Structure for High-Dimensional Covariance Matrices and Its Application in Portfolio Allocation Abstract: Estimation of high-dimensional covariance matrices is an interesting and important research topic. In this article, we propose a dynamic structure and develop an estimation procedure for high-dimensional covariance matrices. Asymptotic properties are derived to justify the estimation procedure and simulation studies are conducted to demonstrate its performance when the sample size is finite. By exploring a financial application, an empirical study shows that portfolio allocation based on dynamic high-dimensional covariance matrices can significantly outperform the market from 1995 to 2014. Our proposed method also outperforms portfolio allocation based on the sample covariance matrix, the covariance matrix based on factor models, and the shrinkage estimator of covariance matrix. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 235-253 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2015.1129969 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1129969 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:235-253 Template-Type: ReDIF-Article 1.0 Author-Name: David Rossell Author-X-Name-First: David Author-X-Name-Last: Rossell Author-Name: Donatello Telesca Author-X-Name-First: Donatello Author-X-Name-Last: Telesca Title: Nonlocal Priors for High-Dimensional Estimation Abstract: Jointly achieving parsimony and good predictive power in high dimensions is a main challenge in statistics. Nonlocal priors (NLPs) possess appealing properties for model choice, but their use for estimation has not been studied in detail. We show that for regular models NLP-based Bayesian model averaging (BMA) shrink spurious parameters either at fast polynomial or quasi-exponential rates as the sample size n increases, while nonspurious parameter estimates are not shrunk. We extend some results to linear models with dimension p growing with n. Coupled with our theoretical investigations, we outline the constructive representation of NLPs as mixtures of truncated distributions that enables simple posterior sampling and extending NLPs beyond previous proposals. Our results show notable high-dimensional estimation for linear models with p > >n at low computational cost. NLPs provided lower estimation error than benchmark and hyper-g priors, SCAD and LASSO in simulations, and in gene expression data achieved higher cross-validated R2 with less predictors. Remarkably, these results were obtained without prescreening variables. Our findings contribute to the debate of whether different priors should be used for estimation and model selection, showing that selection priors may actually be desirable for high-dimensional estimation. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 254-265 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2015.1130634 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1130634 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:254-265 Template-Type: ReDIF-Article 1.0 Author-Name: Tao Zou Author-X-Name-First: Tao Author-X-Name-Last: Zou Author-Name: Wei Lan Author-X-Name-First: Wei Author-X-Name-Last: Lan Author-Name: Hansheng Wang Author-X-Name-First: Hansheng Author-X-Name-Last: Wang Author-Name: Chih-Ling Tsai Author-X-Name-First: Chih-Ling Author-X-Name-Last: Tsai Title: Covariance Regression Analysis Abstract: This article introduces covariance regression analysis for a p-dimensional response vector. The proposed method explores the regression relationship between the p-dimensional covariance matrix and auxiliary information. We study three types of estimators: maximum likelihood, ordinary least squares, and feasible generalized least squares estimators. Then, we demonstrate that these regression estimators are consistent and asymptotically normal. Furthermore, we obtain the high dimensional and large sample properties of the corresponding covariance matrix estimators. Simulation experiments are presented to demonstrate the performance of both regression and covariance matrix estimates. An example is analyzed from the Chinese stock market to illustrate the usefulness of the proposed covariance regression model. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 266-281 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2015.1131699 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1131699 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:266-281 Template-Type: ReDIF-Article 1.0 Author-Name: Jack Kuipers Author-X-Name-First: Jack Author-X-Name-Last: Kuipers Author-Name: Giusi Moffa Author-X-Name-First: Giusi Author-X-Name-Last: Moffa Title: Partition MCMC for Inference on Acyclic Digraphs Abstract: Acyclic digraphs are the underlying representation of Bayesian networks, a widely used class of probabilistic graphical models. Learning the underlying graph from data is a way of gaining insights about the structural properties of a domain. Structure learning forms one of the inference challenges of statistical graphical models. Markov chain Monte Carlo (MCMC) methods, notably structure MCMC, to sample graphs from the posterior distribution given the data are probably the only viable option for Bayesian model averaging. Score modularity and restrictions on the number of parents of each node allow the graphs to be grouped into larger collections, which can be scored as a whole to improve the chain’s convergence. Current examples of algorithms taking advantage of grouping are the biased order MCMC, which acts on the alternative space of permuted triangular matrices, and nonergodic edge reversal moves. Here, we propose a novel algorithm, which employs the underlying combinatorial structure of DAGs to define a new grouping. As a result convergence is improved compared to structure MCMC, while still retaining the property of producing an unbiased sample. Finally, the method can be combined with edge reversal moves to improve the sampler further. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 282-299 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2015.1133426 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1133426 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:282-299 Template-Type: ReDIF-Article 1.0 Author-Name: Jonghyun Yun Author-X-Name-First: Jonghyun Author-X-Name-Last: Yun Author-Name: Fan Yang Author-X-Name-First: Fan Author-X-Name-Last: Yang Author-Name: Yuguo Chen Author-X-Name-First: Yuguo Author-X-Name-Last: Chen Title: Augmented Particle Filters Abstract: Particle filters have been widely used for online filtering problems in state–space models (SSMs). The current available proposal distributions depend either only on the state dynamics, or only on the observation, or on both sources of information but are not available for general SSMs. In this article, we develop a new particle filtering algorithm, called the augmented particle filter (APF), for online filtering problems in SSMs. The APF combines two sets of particles from the observation equation and the state equation, and the state space is augmented to facilitate the weight computation. Theoretical justification of the APF is provided, and the connection between the APF and the optimal particle filter (OPF) in some special SSMs is investigated. The APF shares similar properties as the OPF, but the APF can be applied to a much wider range of models than the OPF. Simulation studies show that the APF performs similarly to or better than the OPF when the OPF is available, and the APF can perform better than other filtering algorithms in the literature when the OPF is not available. Journal: Journal of the American Statistical Association Pages: 300-313 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2015.1135803 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1135803 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:300-313 Template-Type: ReDIF-Article 1.0 Author-Name: Roderick J. Little Author-X-Name-First: Roderick J. Author-X-Name-Last: Little Author-Name: Donald B. Rubin Author-X-Name-First: Donald B. Author-X-Name-Last: Rubin Author-Name: Sahar Z. Zangeneh Author-X-Name-First: Sahar Z. Author-X-Name-Last: Zangeneh Title: Conditions for Ignoring the Missing-Data Mechanism in Likelihood Inferences for Parameter Subsets Abstract: For likelihood-based inferences from data with missing values, models are generally needed for both the data and the missing-data mechanism. However, modeling the mechanism can be challenging, and parameters are often poorly identified. Rubin in 1976 showed that for likelihood and Bayesian inference, sufficient conditions for ignoring the missing data mechanism are (a) the missing data are missing at random (MAR), in the sense that missingness does not depend on the missing values after conditioning on the observed data and (b) the parameters of the data model and the missingness mechanism are distinct, that is, there are no a priori ties, via parameter space restrictions or prior distributions, between these two sets of parameters. These conditions are sufficient but not always necessary, and they relate to the full vector of parameters of the data model. We propose definitions of partially MAR and ignorability for a subvector of the parameters of particular substantive interest, for direct likelihood/Bayesian and frequentist likelihood-based inference. We apply these definitions to a variety of examples. We also discuss conditioning on the pattern of missingness, as an alternative strategy for avoiding the need to model the missingness mechanism. Journal: Journal of the American Statistical Association Pages: 314-320 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2015.1136826 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1136826 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:314-320 Template-Type: ReDIF-Article 1.0 Author-Name: Colin B. Fogarty Author-X-Name-First: Colin B. Author-X-Name-Last: Fogarty Author-Name: Pixu Shi Author-X-Name-First: Pixu Author-X-Name-Last: Shi Author-Name: Mark E. Mikkelsen Author-X-Name-First: Mark E. Author-X-Name-Last: Mikkelsen Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Title: Randomization Inference and Sensitivity Analysis for Composite Null Hypotheses With Binary Outcomes in Matched Observational Studies Abstract: We present methods for conducting hypothesis testing and sensitivity analyses for composite null hypotheses in matched observational studies when outcomes are binary. Causal estimands discussed include the causal risk difference, causal risk ratio, and the effect ratio. We show that inference under the assumption of no unmeasured confounding can be performed by solving an integer linear program, while inference allowing for unmeasured confounding of a given strength requires solving an integer quadratic program. Through simulation studies and data examples, we demonstrate that our formulation allows these problems to be solved in an expedient manner even for large datasets and for large strata. We further exhibit that through our formulation, one can assess the impact of various assumptions about the potential outcomes on the performed inference. R scripts are provided that implement our methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 321-331 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1138865 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1138865 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:321-331 Template-Type: ReDIF-Article 1.0 Author-Name: Jia Li Author-X-Name-First: Jia Author-X-Name-Last: Li Author-Name: Viktor Todorov Author-X-Name-First: Viktor Author-X-Name-Last: Todorov Author-Name: George Tauchen Author-X-Name-First: George Author-X-Name-Last: Tauchen Title: Robust Jump Regressions Abstract: We develop robust inference methods for studying linear dependence between the jumps of discretely observed processes at high frequency. Unlike classical linear regressions, jump regressions are determined by a small number of jumps occurring over a fixed time interval and the rest of the components of the processes around the jump times. The latter are the continuous martingale parts of the processes as well as observation noise. By sampling more frequently the role of these components, which are hidden in the observed price, shrinks asymptotically. The robustness of our inference procedure is with respect to outliers, which are of particular importance in the current setting of relatively small number of jump observations. This is achieved by using nonsmooth loss functions (like L1) in the estimation. Unlike classical robust methods, the limit of the objective function here remains nonsmooth. The proposed method is also robust to measurement error in the observed processes, which is achieved by locally smoothing the high-frequency increments. In an empirical application to financial data, we illustrate the usefulness of the robust techniques by contrasting the behavior of robust and ordinary least regression (OLS)-type jump regressions in periods including disruptions of the financial markets such as so-called “flash crashes.” Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 332-341 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1138866 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1138866 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:332-341 Template-Type: ReDIF-Article 1.0 Author-Name: Yuan Huang Author-X-Name-First: Yuan Author-X-Name-Last: Huang Author-Name: Qingzhao Zhang Author-X-Name-First: Qingzhao Author-X-Name-Last: Zhang Author-Name: Sanguo Zhang Author-X-Name-First: Sanguo Author-X-Name-Last: Zhang Author-Name: Jian Huang Author-X-Name-First: Jian Author-X-Name-Last: Huang Author-Name: Shuangge Ma Author-X-Name-First: Shuangge Author-X-Name-Last: Ma Title: Promoting Similarity of Sparsity Structures in Integrative Analysis With Penalization Abstract: For data with high-dimensional covariates but small sample sizes, the analysis of single datasets often generates unsatisfactory results. The integrative analysis of multiple independent datasets provides an effective way of pooling information and outperforms single-dataset and several alternative multi-datasets methods. Under many scenarios, multiple datasets are expected to share common important covariates, that is, the corresponding models have similarity in their sparsity structures. However, the existing methods do not have a mechanism to promote the similarity in sparsity structures in integrative analysis. In this study, we consider penalized variable selection and estimation in integrative analysis. We develop an L0-penalty-based method, which explicitly promotes the similarity in sparsity structures. Computationally it is realized using a coordinate descent algorithm. Theoretically it has the selection and estimation consistency properties. Under a wide spectrum of simulation scenarios, it has identification and estimation performance comparable to or better than the alternatives. In the analysis of three lung cancer datasets with gene expression measurements, it identifies genes with sound biological implications and satisfactory prediction performance. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 342-350 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1139497 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1139497 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:342-350 Template-Type: ReDIF-Article 1.0 Author-Name: Kwun Chuen Gary Chan Author-X-Name-First: Kwun Chuen Gary Author-X-Name-Last: Chan Author-Name: Mei-Cheng Wang Author-X-Name-First: Mei-Cheng Author-X-Name-Last: Wang Title: Semiparametric Modeling and Estimation of the Terminal Behavior of Recurrent Marker Processes Before Failure Events Abstract: Recurrent event processes with marker measurements are mostly and largely studied with forward time models starting from an initial event. Interestingly, the processes could exhibit important terminal behavior during a time period before occurrence of the failure event. A natural and direct way to study recurrent events prior to a failure event is to align the processes using the failure event as the time origin and to examine the terminal behavior by a backward time model. This article studies regression models for backward recurrent marker processes by counting time backward from the failure event. A three-level semiparametric regression model is proposed for jointly modeling the time to a failure event, the backward recurrent event process, and the marker observed at the time of each backward recurrent event. The first level is a proportional hazards model for the failure time, the second level is a proportional rate model for the recurrent events occurring before the failure event, and the third level is a proportional mean model for the marker given the occurrence of a recurrent event backward in time. By jointly modeling the three components, estimating equations can be constructed for marked counting processes to estimate the target parameters in the three-level regression models. Large sample properties of the proposed estimators are studied and established. The proposed models and methods are illustrated by a community-based AIDS clinical trial to examine the terminal behavior of frequencies and severities of opportunistic infections among HIV-infected individuals in the last 6 months of life. Journal: Journal of the American Statistical Association Pages: 351-362 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1140051 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1140051 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:351-362 Template-Type: ReDIF-Article 1.0 Author-Name: Simón Lunagómez Author-X-Name-First: Simón Author-X-Name-Last: Lunagómez Author-Name: Sayan Mukherjee Author-X-Name-First: Sayan Author-X-Name-Last: Mukherjee Author-Name: Robert L. Wolpert Author-X-Name-First: Robert L. Author-X-Name-Last: Wolpert Author-Name: Edoardo M. Airoldi Author-X-Name-First: Edoardo M. Author-X-Name-Last: Airoldi Title: Geometric Representations of Random Hypergraphs Abstract: We introduce a novel parameterization of distributions on hypergraphs based on the geometry of points in Rd${\mathbb {R}}^d$. The idea is to induce distributions on hypergraphs by placing priors on point configurations via spatial processes. This specification is then used to infer conditional independence models, or Markov structure, for multivariate distributions. This approach results in a broader class of conditional independence models beyond standard graphical models. Factorizations that cannot be retrieved via a graph are possible. Inference of nondecomposable graphical models is possible without requiring decomposability, or the need of Gaussian assumptions. This approach leads to new Metropolis-Hastings Markov chain Monte Carlo algorithms with both local and global moves in graph space, generally offers greater control on the distribution of graph features than currently possible, and naturally extends to hypergraphs. We provide a comparative performance evaluation against state-of-the-art approaches, and illustrate the utility of this approach on simulated and real data. Journal: Journal of the American Statistical Association Pages: 363-383 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1141686 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1141686 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:363-383 Template-Type: ReDIF-Article 1.0 Author-Name: Ilze Kalnina Author-X-Name-First: Ilze Author-X-Name-Last: Kalnina Author-Name: Dacheng Xiu Author-X-Name-First: Dacheng Author-X-Name-Last: Xiu Title: Nonparametric Estimation of the Leverage Effect: A Trade-Off Between Robustness and Efficiency Abstract: We consider two new approaches to nonparametric estimation of the leverage effect. The first approach uses stock prices alone. The second approach uses the data on stock prices as well as a certain volatility instrument, such as the Chicago Board Options Exchange (CBOE) volatility index (VIX) or the Black–Scholes implied volatility. The theoretical justification for the instrument-based estimator relies on a certain invariance property, which can be exploited when high-frequency data are available. The price-only estimator is more robust since it is valid under weaker assumptions. However, in the presence of a valid volatility instrument, the price-only estimator is inefficient as the instrument-based estimator has a faster rate of convergence.We consider an empirical application, in which we study the relationship between the leverage effect and the debt-to-equity ratio, credit risk, and illiquidity. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 384-396 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1141687 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1141687 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:384-396 Template-Type: ReDIF-Article 1.0 Author-Name: Hao Chen Author-X-Name-First: Hao Author-X-Name-Last: Chen Author-Name: Jerome H. Friedman Author-X-Name-First: Jerome H. Author-X-Name-Last: Friedman Title: A New Graph-Based Two-Sample Test for Multivariate and Object Data Abstract: Two-sample tests for multivariate data and especially for non-Euclidean data are not well explored. This article presents a novel test statistic based on a similarity graph constructed on the pooled observations from the two samples. It can be applied to multivariate data and non-Euclidean data as long as a dissimilarity measure on the sample space can be defined, which can usually be provided by domain experts. Existing tests based on a similarity graph lack power either for location or for scale alternatives. The new test uses a common pattern that was overlooked previously, and works for both types of alternatives. The test exhibits substantial power gains in simulation studies. Its asymptotic permutation null distribution is derived and shown to work well under finite samples, facilitating its application to large datasets. The new test is illustrated on two applications: The assessment of covariate balance in a matched observational study, and the comparison of network data under different conditions. Journal: Journal of the American Statistical Association Pages: 397-409 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1147356 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1147356 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:397-409 Template-Type: ReDIF-Article 1.0 Author-Name: Shujie Ma Author-X-Name-First: Shujie Author-X-Name-Last: Ma Author-Name: Jian Huang Author-X-Name-First: Jian Author-X-Name-Last: Huang Title: A Concave Pairwise Fusion Approach to Subgroup Analysis Abstract: An important step in developing individualized treatment strategies is correct identification of subgroups of a heterogeneous population to allow specific treatment for each subgroup. This article considers the problem using samples drawn from a population consisting of subgroups with different mean values, along with certain covariates. We propose a penalized approach for subgroup analysis based on a regression model, in which heterogeneity is driven by unobserved latent factors and thus can be represented by using subject-specific intercepts. We apply concave penalty functions to pairwise differences of the intercepts. This procedure automatically divides the observations into subgroups. To implement the proposed approach, we develop an alternating direction method of multipliers algorithm with concave penalties and demonstrate its convergence. We also establish the theoretical properties of our proposed estimator and determine the order requirement of the minimal difference of signals between groups to recover them. These results provide a sound basis for making statistical inference in subgroup analysis. Our proposed method is further illustrated by simulation studies and analysis of a Cleveland heart disease dataset. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 410-423 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1148039 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1148039 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:410-423 Template-Type: ReDIF-Article 1.0 Author-Name: Mark Fiecas Author-X-Name-First: Mark Author-X-Name-Last: Fiecas Author-Name: Jürgen Franke Author-X-Name-First: Jürgen Author-X-Name-Last: Franke Author-Name: Rainer von Sachs Author-X-Name-First: Rainer Author-X-Name-Last: von Sachs Author-Name: Joseph Tadjuidje Kamgaing Author-X-Name-First: Joseph Author-X-Name-Last: Tadjuidje Kamgaing Title: Shrinkage Estimation for Multivariate Hidden Markov Models Abstract: Motivated from a changing market environment over time, we consider high-dimensional data such as financial returns, generated by a hidden Markov model that allows for switching between different regimes or states. To get more stable estimates of the covariance matrices of the different states, potentially driven by a number of observations that are small compared to the dimension, we modify the expectation–maximization (EM) algorithm so that it yields the shrinkage estimators for the covariance matrices. The final algorithm turns out to reproduce better estimates not only for the covariance matrices but also for the transition matrix. It results into a more stable and reliable filter that allows for reconstructing the values of the hidden Markov chain. In addition to a simulation study performed in this article, we also present a series of theoretical results that include dimensionality asymptotics and provide the motivation for certain techniques used in the algorithm. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 424-435 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1148608 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1148608 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:424-435 Template-Type: ReDIF-Article 1.0 Author-Name: Andreas Alfons Author-X-Name-First: Andreas Author-X-Name-Last: Alfons Author-Name: Christophe Croux Author-X-Name-First: Christophe Author-X-Name-Last: Croux Author-Name: Peter Filzmoser Author-X-Name-First: Peter Author-X-Name-Last: Filzmoser Title: Robust Maximum Association Estimators Abstract: The maximum association between two multivariate variables X$\boldsymbol{X}$ and Y$\boldsymbol{Y}$ is defined as the maximal value that a bivariate association measure between one-dimensional projections αtX${\boldsymbol{\alpha }}^{t} \boldsymbol{X}$ and βtY${\boldsymbol{\beta }}^{t} \boldsymbol{Y}$ can attain. Taking the Pearson correlation as projection index results in the first canonical correlation coefficient. We propose to use more robust association measures, such as Spearman’s or Kendall’s rank correlation, or association measures derived from bivariate scatter matrices. We study the robustness of the proposed maximum association measures and the corresponding estimators of the coefficients yielding the maximum association. In the important special case of Y$\boldsymbol{Y}$ being univariate, maximum rank correlation estimators yield regression estimators that are invariant against monotonic transformations of the response. We obtain asymptotic variances for this special case. It turns out that maximum rank correlation estimators combine good efficiency and robustness properties. Simulations and a real data example illustrate the robustness and the power for handling nonlinear relationships of these estimators. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 436-445 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1148609 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1148609 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:436-445 Template-Type: ReDIF-Article 1.0 Author-Name: Andreas Hagemann Author-X-Name-First: Andreas Author-X-Name-Last: Hagemann Title: Cluster-Robust Bootstrap Inference in Quantile Regression Models Abstract: In this article I develop a wild bootstrap procedure for cluster-robust inference in linear quantile regression models. I show that the bootstrap leads to asymptotically valid inference on the entire quantile regression process in a setting with a large number of small, heterogeneous clusters and provides consistent estimates of the asymptotic covariance function of that process. The proposed bootstrap procedure is easy to implement and performs well even when the number of clusters is much smaller than the sample size. An application to Project STAR data is provided. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 446-456 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1148610 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1148610 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:446-456 Template-Type: ReDIF-Article 1.0 Author-Name: Thomas A. Murray Author-X-Name-First: Thomas A. Author-X-Name-Last: Murray Author-Name: Peter F. Thall Author-X-Name-First: Peter F. Author-X-Name-Last: Thall Author-Name: Ying Yuan Author-X-Name-First: Ying Author-X-Name-Last: Yuan Author-Name: Sarah McAvoy Author-X-Name-First: Sarah Author-X-Name-Last: McAvoy Author-Name: Daniel R. Gomez Author-X-Name-First: Daniel R. Author-X-Name-Last: Gomez Title: Robust Treatment Comparison Based on Utilities of Semi-Competing Risks in Non-Small-Cell Lung Cancer Abstract: A design is presented for a randomized clinical trial comparing two second-line treatments, chemotherapy versus chemotherapy plus reirradiation, for treatment of recurrent non-small-cell lung cancer. The central research question is whether the potential efficacy benefit that adding reirradiation to chemotherapy may provide justifies its potential for increasing the risk of toxicity. The design uses two co-primary outcomes: time to disease progression or death, and time to severe toxicity. Because patients may be given an active third-line treatment at disease progression that confounds second-line treatment effects on toxicity and survival following disease progression, for the purpose of this comparative study follow-up ends at disease progression or death. In contrast, follow-up for disease progression or death continues after severe toxicity, so these are semi-competing risks. A conditionally conjugate Bayesian model that is robust to misspecification is formulated using piecewise exponential distributions. A numerical utility function is elicited from the physicians that characterizes desirabilities of the possible co-primary outcome realizations. A comparative test based on posterior mean utilities is proposed. A simulation study is presented to evaluate test performance for a variety of treatment differences, and a sensitivity assessment to the elicited utility function is performed. General guidelines are given for constructing a design in similar settings, and a computer program for simulation and trial conduct is provided. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 11-23 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1176926 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1176926 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:11-23 Template-Type: ReDIF-Article 1.0 Author-Name: J. L. Scealy Author-X-Name-First: J. L. Author-X-Name-Last: Scealy Author-Name: A. H. Welsh Author-X-Name-First: A. H. Author-X-Name-Last: Welsh Title: A Directional Mixed Effects Model for Compositional Expenditure Data Abstract: Compositional data are vectors of proportions defined on the unit simplex and this type of constrained data occur frequently in Government surveys. It is also possible for the compositional data to be correlated due to the clustering or grouping of the observations within small domains or areas. We propose a new class of the mixed model for compositional data based on the Kent distribution for directional data, where the random effects also have Kent distributions. One useful property of the new directional mixed model is that the marginal mean direction has a closed form and is interpretable. The random effects enter the model in a multiplicative way via the product of a set of rotation matrices and the conditional mean direction is a random rotation of the marginal mean direction. In small area estimation settings, the mean proportions are usually of primary interest and these are shown to be simple functions of the marginal mean direction. For estimation, we apply a quasi-likelihood method which results in solving a new set of generalized estimating equations and these are shown to have low bias in typical situations. For inference, we use a nonparametric bootstrap method for clustered data which does not rely on estimates of the shape parameters (shape parameters are difficult to estimate in Kent models). We analyze data from the 2009–2010 Australian Household Expenditure Survey CURF (confidentialized unit record file). We predict the proportions of total weekly expenditure on food and housing costs for households in a chosen set of domains. The new approach is shown to be more tractable than the traditional approach based on the logratio transformation. Journal: Journal of the American Statistical Association Pages: 24-36 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1189336 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1189336 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:24-36 Template-Type: ReDIF-Article 1.0 Author-Name: Bledar A. Konomi Author-X-Name-First: Bledar A. Author-X-Name-Last: Konomi Author-Name: Georgios Karagiannis Author-X-Name-First: Georgios Author-X-Name-Last: Karagiannis Author-Name: Kevin Lai Author-X-Name-First: Kevin Author-X-Name-Last: Lai Author-Name: Guang Lin Author-X-Name-First: Guang Author-X-Name-Last: Lin Title: Bayesian Treed Calibration: An Application to Carbon Capture With AX Sorbent Abstract: In cases where field (or experimental) measurements are not available, computer models can model real physical or engineering systems to reproduce their outcomes. They are usually calibrated in light of experimental data to create a better representation of the real system. Statistical methods, based on Gaussian processes, for calibration and prediction have been especially important when the computer models are expensive and experimental data limited. In this article, we develop the Bayesian treed calibration (BTC) as an extension of standard Gaussian process calibration methods to deal with nonstationarity computer models and/or their discrepancy from the field (or experimental) data. Our proposed method partitions both the calibration and observable input space, based on a binary tree partitioning, into subregions where existing model calibration methods can be applied to connect a computer model with the real system. The estimation of the parameters in the proposed model is carried out using Markov chain Monte Carlo (MCMC) computational techniques. Different strategies have been applied to improve mixing. We illustrate our method in two artificial examples and a real application that concerns the capture of carbon dioxide with AX amine based sorbents. The source code and the examples analyzed in this article are available as part of the supplementary materials. Journal: Journal of the American Statistical Association Pages: 37-53 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1190279 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1190279 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:37-53 Template-Type: ReDIF-Article 1.0 Author-Name: Audrey Mauguen Author-X-Name-First: Audrey Author-X-Name-Last: Mauguen Author-Name: Emily C. Zabor Author-X-Name-First: Emily C. Author-X-Name-Last: Zabor Author-Name: Nancy E. Thomas Author-X-Name-First: Nancy E. Author-X-Name-Last: Thomas Author-Name: Marianne Berwick Author-X-Name-First: Marianne Author-X-Name-Last: Berwick Author-Name: Venkatraman E. Seshan Author-X-Name-First: Venkatraman E. Author-X-Name-Last: Seshan Author-Name: Colin B. Begg Author-X-Name-First: Colin B. Author-X-Name-Last: Begg Title: Defining Cancer Subtypes With Distinctive Etiologic Profiles: An Application to the Epidemiology of Melanoma Abstract: We showcase a novel analytic strategy to identify subtypes of cancer that possess distinctive causal factors, that is, subtypes that are “etiologically” distinct. The method involves the integrated analysis of two types of study design: an incident series of cases with double primary cancers with detailed information on tumor characteristics that can be used to define the subtypes; a case-series of incident cases with information on known risk factors that can be used to investigate the specific risk factors that distinguish the subtypes. The methods are applied to a rich melanoma dataset with detailed information on pathologic tumor factors, and comprehensive information on known genetic and environmental risk factors for melanoma. Identification of the optimal subtyping solution is accomplished using a novel clustering analysis that seeks to maximize a measure that characterizes the distinctiveness of the distributions of risk factors across the subtypes and that is a function of the correlations of tumor factors in the case-specific tumor pairs. This analysis is challenged by the presence of extensive missing data. If successful, studies of this nature offer the opportunity for efficient study design to identify unknown risk factors whose effects are concentrated in defined subtypes. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 54-63 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1191499 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1191499 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:54-63 Template-Type: ReDIF-Article 1.0 Author-Name: Ian Barnett Author-X-Name-First: Ian Author-X-Name-Last: Barnett Author-Name: Rajarshi Mukherjee Author-X-Name-First: Rajarshi Author-X-Name-Last: Mukherjee Author-Name: Xihong Lin Author-X-Name-First: Xihong Author-X-Name-Last: Lin Title: The Generalized Higher Criticism for Testing SNP-Set Effects in Genetic Association Studies Abstract: It is of substantial interest to study the effects of genes, genetic pathways, and networks on the risk of complex diseases. These genetic constructs each contain multiple SNPs, which are often correlated and function jointly, and might be large in number. However, only a sparse subset of SNPs in a genetic construct is generally associated with the disease of interest. In this article, we propose the generalized higher criticism (GHC) to test for the association between an SNP set and a disease outcome. The higher criticism is a test traditionally used in high-dimensional signal detection settings when marginal test statistics are independent and the number of parameters is very large. However, these assumptions do not always hold in genetic association studies, due to linkage disequilibrium among SNPs and the finite number of SNPs in an SNP set in each genetic construct. The proposed GHC overcomes the limitations of the higher criticism by allowing for arbitrary correlation structures among the SNPs in an SNP-set, while performing accurate analytic p-value calculations for any finite number of SNPs in the SNP-set. We obtain the detection boundary of the GHC test. We compared empirically using simulations the power of the GHC method with existing SNP-set tests over a range of genetic regions with varied correlation structures and signal sparsity. We apply the proposed methods to analyze the CGEM breast cancer genome-wide association study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 64-76 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1192039 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1192039 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:64-76 Template-Type: ReDIF-Article 1.0 Author-Name: Sungmin Kim Author-X-Name-First: Sungmin Author-X-Name-Last: Kim Author-Name: Kevin Potter Author-X-Name-First: Kevin Author-X-Name-Last: Potter Author-Name: Peter F. Craigmile Author-X-Name-First: Peter F. Author-X-Name-Last: Craigmile Author-Name: Mario Peruggia Author-X-Name-First: Mario Author-X-Name-Last: Peruggia Author-Name: Trisha Van Zandt Author-X-Name-First: Trisha Author-X-Name-Last: Van Zandt Title: A Bayesian Race Model for Recognition Memory Abstract: Many psychological models use the idea of a trace, which represents a change in a person’s cognitive state that arises as a result of processing a given stimulus. These models assume that a trace is always laid down when a stimulus is processed. In addition, some of these models explain how response times (RTs) and response accuracies arise from a process in which the different traces race against each other. In this article, we present a Bayesian hierarchical model of RT and accuracy in a difficult recognition memory experiment. The model includes a stochastic component that probabilistically determines whether a trace is laid down. The RTs and accuracies are modeled using a minimum gamma race model, with extra model components that allow for the effects of stimulus, sequential dependencies, and trend. Subject-specific effects, as well as ancillary effects due to processes such as perceptual encoding and guessing, are also captured in the hierarchy. Predictive checks show that our model fits the data well. Marginal likelihood evaluations show better predictive performance of our model compared to an approximate Weibull model. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 77-91 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1194844 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1194844 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:77-91 Template-Type: ReDIF-Article 1.0 Author-Name: Curtis B. Storlie Author-X-Name-First: Curtis B. Author-X-Name-Last: Storlie Author-Name: Brian J. Reich Author-X-Name-First: Brian J. Author-X-Name-Last: Reich Author-Name: William N. Rust Author-X-Name-First: William N. Author-X-Name-Last: Rust Author-Name: Lawrence O. Ticknor Author-X-Name-First: Lawrence O. Author-X-Name-Last: Ticknor Author-Name: Amanda M. Bonnie Author-X-Name-First: Amanda M. Author-X-Name-Last: Bonnie Author-Name: Andrew J. Montoya Author-X-Name-First: Andrew J. Author-X-Name-Last: Montoya Author-Name: Sarah E. Michalak Author-X-Name-First: Sarah E. Author-X-Name-Last: Michalak Title: Spatiotemporal Modeling of Node Temperatures in Supercomputers Abstract: Los Alamos National Laboratory is home to many large supercomputing clusters. These clusters require an enormous amount of power (∼500–2000 kW each), and most of this energy is converted into heat. Thus, cooling the components of the supercomputer becomes a critical and expensive endeavor. Recently, a project was initiated to investigate the effect that changes to the cooling system in a machine room had on three large machines that were housed there. Coupled with this goal was the aim to develop a general good-practice for characterizing the effect of cooling changes and monitoring machine node temperatures in this and other machine rooms. This article focuses on the statistical approach used to quantify the effect that several cooling changes to the room had on the temperatures of the individual nodes of the computers. The largest cluster in the room has 1600 nodes that run a variety of jobs during general use. Since extremes temperatures are important, a Normal distribution plus generalized Pareto distribution for the upper tail is used to model the marginal distribution, along with a Gaussian process copula to account for spatio-temporal dependence. A Gaussian Markov random field (GMRF) model is used to model the spatial effects on the node temperatures as the cooling changes take place. This model is then used to assess the condition of the node temperatures after each change to the room. The analysis approach was used to uncover the cause of a problematic episode of overheating nodes on one of the supercomputing clusters. This same approach can easily be applied to monitor and investigate cooling systems at other data centers, as well. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 92-108 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1195271 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1195271 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:92-108 Template-Type: ReDIF-Article 1.0 Author-Name: M. P. Wand Author-X-Name-First: M. P. Author-X-Name-Last: Wand Title: Fast Approximate Inference for Arbitrarily Large Semiparametric Regression Models via Message Passing Abstract: We show how the notion of message passing can be used to streamline the algebra and computer coding for fast approximate inference in large Bayesian semiparametric regression models. In particular, this approach is amenable to handling arbitrarily large models of particular types once a set of primitive operations is established. The approach is founded upon a message passing formulation of mean field variational Bayes that utilizes factor graph representations of statistical models. The underlying principles apply to general Bayesian hierarchical models although we focus on semiparametric regression. The notion of factor graph fragments is introduced and is shown to facilitate compartmentalization of the required algebra and coding. The resultant algorithms have ready-to-implement closed form expressions and allow a broad class of arbitrarily large semiparametric regression models to be handled. Ongoing software projects such as Infer.NET and Stan support variational-type inference for particular model classes. This article is not concerned with software packages per se and focuses on the underlying tenets of scalable variational inference algorithms. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 137-168 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1197833 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1197833 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:137-168 Template-Type: ReDIF-Article 1.0 Author-Name: Michael W. Robbins Author-X-Name-First: Michael W. Author-X-Name-Last: Robbins Author-Name: Jessica Saunders Author-X-Name-First: Jessica Author-X-Name-Last: Saunders Author-Name: Beau Kilmer Author-X-Name-First: Beau Author-X-Name-Last: Kilmer Title: A Framework for Synthetic Control Methods With High-Dimensional, Micro-Level Data: Evaluating a Neighborhood-Specific Crime Intervention Abstract: The synthetic control method is an increasingly popular tool for analysis of program efficacy. Here, it is applied to a neighborhood-specific crime intervention in Roanoke, VA, and several novel contributions are made to the synthetic control toolkit. We examine high-dimensional data at a granular level (the treated area has several cases, a large number of untreated comparison cases, and multiple outcome measures). Calibration is used to develop weights that exactly match the synthetic control to the treated region across several outcomes and time periods. Further, we illustrate the importance of adjusting the estimated effect of treatment for the design effect implicit within the weights. A permutation procedure is proposed wherein countless placebo areas can be constructed, enabling estimation of p-values under a robust set of assumptions. An omnibus statistic is introduced that is used to jointly test for the presence of an intervention effect across multiple outcomes and post-intervention time periods. Analyses indicate that the Roanoke crime intervention did decrease crime levels, but the estimated effect of the intervention is not as statistically significant as it would have been had less rigorous approaches been used. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 109-126 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1213634 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1213634 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:109-126 Template-Type: ReDIF-Article 1.0 Author-Name: Brenda López Cabrera Author-X-Name-First: Brenda López Author-X-Name-Last: Cabrera Author-Name: Franziska Schulz Author-X-Name-First: Franziska Author-X-Name-Last: Schulz Title: Forecasting Generalized Quantiles of Electricity Demand: A Functional Data Approach Abstract: Electricity load forecasts are an integral part of many decision-making processes in the electricity market. However, most literature on electricity load forecasting concentrates on deterministic forecasts, neglecting possibly important information about uncertainty. A more complete picture of future demand can be obtained by using distributional forecasts, allowing for more efficient decision-making. A predictive density can be fully characterized by tail measures such as quantiles and expectiles. Furthermore, interest often lies in the accurate estimation of tail events rather than in the mean or median. We propose a new methodology to obtain probabilistic forecasts of electricity load that is based on functional data analysis of generalized quantile curves. The core of the methodology is dimension reduction based on functional principal components of tail curves with dependence structure. The approach has several advantages, such as flexible inclusion of explanatory variables like meteorological forecasts and no distributional assumptions. The methodology is applied to load data from a transmission system operator (TSO) and a balancing unit in Germany. Our forecast method is evaluated against other models including the TSO forecast model. It outperforms them in terms of mean absolute percentage error and mean squared error. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 127-136 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1219259 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1219259 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:127-136 Template-Type: ReDIF-Article 1.0 Author-Name: Valen E. Johnson Author-X-Name-First: Valen E. Author-X-Name-Last: Johnson Author-Name: Richard D. Payne Author-X-Name-First: Richard D. Author-X-Name-Last: Payne Author-Name: Tianying Wang Author-X-Name-First: Tianying Author-X-Name-Last: Wang Author-Name: Alex Asher Author-X-Name-First: Alex Author-X-Name-Last: Asher Author-Name: Soutrik Mandal Author-X-Name-First: Soutrik Author-X-Name-Last: Mandal Title: On the Reproducibility of Psychological Science Abstract: Investigators from a large consortium of scientists recently performed a multi-year study in which they replicated 100 psychology experiments. Although statistically significant results were reported in 97% of the original studies, statistical significance was achieved in only 36% of the replicated studies. This article presents a reanalysis of these data based on a formal statistical model that accounts for publication bias by treating outcomes from unpublished studies as missing data, while simultaneously estimating the distribution of effect sizes for those studies that tested nonnull effects. The resulting model suggests that more than 90% of tests performed in eligible psychology experiments tested negligible effects, and that publication biases based on p-values caused the observed rates of nonreproducibility. The results of this reanalysis provide a compelling argument for both increasing the threshold required for declaring scientific discoveries and for adopting statistical summaries of evidence that account for the high proportion of tested hypotheses that are false. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1-10 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1240079 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240079 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:1-10 Template-Type: ReDIF-Article 1.0 Author-Name: Dustin Tran Author-X-Name-First: Dustin Author-X-Name-Last: Tran Author-Name: David M. Blei Author-X-Name-First: David M. Author-X-Name-Last: Blei Title: Comment Journal: Journal of the American Statistical Association Pages: 156-158 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1270044 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270044 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:156-158 Template-Type: ReDIF-Article 1.0 Author-Name: Wanzhu Tu Author-X-Name-First: Wanzhu Author-X-Name-Last: Tu Title: Comment Journal: Journal of the American Statistical Association Pages: 158-161 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1270045 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270045 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:158-161 Template-Type: ReDIF-Article 1.0 Author-Name: Philip T. Reiss Author-X-Name-First: Philip T. Author-X-Name-Last: Reiss Author-Name: Jeff Goldsmith Author-X-Name-First: Jeff Author-X-Name-Last: Goldsmith Title: Comment Journal: Journal of the American Statistical Association Pages: 161-164 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1270049 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270049 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:161-164 Template-Type: ReDIF-Article 1.0 Author-Name: Simon N. Wood Author-X-Name-First: Simon N. Author-X-Name-Last: Wood Title: Comment Journal: Journal of the American Statistical Association Pages: 164-166 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1270050 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270050 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:164-166 Template-Type: ReDIF-Article 1.0 Author-Name: M. P. Wand Author-X-Name-First: M. P. Author-X-Name-Last: Wand Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 166-168 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1270051 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270051 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:166-168 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Correction Journal: Journal of the American Statistical Association Pages: 465-465 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1270057 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270057 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:465-465 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Correction Abstract: Articles in the June 2016 issue of the Journal of the American Statistical Association unintentionally omitted some author affiliations. Following is a complete list of authors and their affiliations Journal: Journal of the American Statistical Association Pages: 466-469 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2016.1270064 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270064 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:466-469 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Book Reviews Journal: Journal of the American Statistical Association Pages: 457-464 Issue: 517 Volume: 112 Year: 2017 Month: 1 X-DOI: 10.1080/01621459.2017.1286186 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1286186 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:457-464 Template-Type: ReDIF-Article 1.0 Author-Name: Chung Eun Lee Author-X-Name-First: Chung Eun Author-X-Name-Last: Lee Author-Name: Xiaofeng Shao Author-X-Name-First: Xiaofeng Author-X-Name-Last: Shao Title: Martingale Difference Divergence Matrix and Its Application to Dimension Reduction for Stationary Multivariate Time Series Abstract: In this article, we introduce a new methodology to perform dimension reduction for a stationary multivariate time series. Our method is motivated by the consideration of optimal prediction and focuses on the reduction of the effective dimension in conditional mean of time series given the past information. In particular, we seek a contemporaneous linear transformation such that the transformed time series has two parts with one part being conditionally mean independent of the past. To achieve this goal, we first propose the so-called martingale difference divergence matrix (MDDM), which can quantify the conditional mean independence of V ∈ Rp given U ∈ Rq and also encodes the number and form of linear combinations of V that are conditional mean independent of U. Our dimension reduction procedure is based on eigen-decomposition of the cumulative martingale difference divergence matrix, which is an extension of MDDM to the time series context. Interestingly, there is a static factor model representation for our dimension reduction framework and it has subtle difference from the existing static factor model used in the time series literature. Some theory is also provided about the rate of convergence of eigenvalue and eigenvector of the sample cumulative MDDM in the fixed-dimensional setting. Favorable finite sample performance is demonstrated via simulations and real data illustrations in comparison with some existing methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 216-229 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1240083 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240083 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:216-229 Template-Type: ReDIF-Article 1.0 Author-Name: Susan Athey Author-X-Name-First: Susan Author-X-Name-Last: Athey Author-Name: Dean Eckles Author-X-Name-First: Dean Author-X-Name-Last: Eckles Author-Name: Guido W. Imbens Author-X-Name-First: Guido W. Author-X-Name-Last: Imbens Title: Exact p-Values for Network Interference Abstract: We study the calculation of exact p-values for a large class of nonsharp null hypotheses about treatment effects in a setting with data from experiments involving members of a single connected network. The class includes null hypotheses that limit the effect of one unit’s treatment status on another according to the distance between units, for example, the hypothesis might specify that the treatment status of immediate neighbors has no effect, or that units more than two edges away have no effect. We also consider hypotheses concerning the validity of sparsification of a network (e.g., based on the strength of ties) and hypotheses restricting heterogeneity in peer effects (so that, e.g., only the number or fraction treated among neighboring units matters). Our general approach is to define an artificial experiment, such that the null hypothesis that was not sharp for the original experiment is sharp for the artificial experiment, and such that the randomization analysis for the artificial experiment is validated by the design of the original experiment. Journal: Journal of the American Statistical Association Pages: 230-240 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1241178 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1241178 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:230-240 Template-Type: ReDIF-Article 1.0 Author-Name: Kehui Chen Author-X-Name-First: Kehui Author-X-Name-Last: Chen Author-Name: Jing Lei Author-X-Name-First: Jing Author-X-Name-Last: Lei Title: Network Cross-Validation for Determining the Number of Communities in Network Data Abstract: The stochastic block model (SBM) and its variants have been a popular tool for analyzing large network data with community structures. In this article, we develop an efficient network cross-validation (NCV) approach to determine the number of communities, as well as to choose between the regular stochastic block model and the degree corrected block model (DCBM). The proposed NCV method is based on a block-wise node-pair splitting technique, combined with an integrated step of community recovery using sub-blocks of the adjacency matrix. We prove that the probability of under-selection vanishes as the number of nodes increases, under mild conditions satisfied by a wide range of popular community recovery algorithms. The solid performance of our method is also demonstrated in extensive simulations and two data examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 241-251 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1246365 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1246365 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:241-251 Template-Type: ReDIF-Article 1.0 Author-Name: Fang Han Author-X-Name-First: Fang Author-X-Name-Last: Han Author-Name: Han Liu Author-X-Name-First: Han Author-X-Name-Last: Liu Title: ECA: High-Dimensional Elliptical Component Analysis in Non-Gaussian Distributions Abstract: We present a robust alternative to principal component analysis (PCA)—called elliptical component analysis (ECA)—for analyzing high-dimensional, elliptically distributed data. ECA estimates the eigenspace of the covariance matrix of the elliptical data. To cope with heavy-tailed elliptical distributions, a multivariate rank statistic is exploited. At the model-level, we consider two settings: either that the leading eigenvectors of the covariance matrix are nonsparse or that they are sparse. Methodologically, we propose ECA procedures for both nonsparse and sparse settings. Theoretically, we provide both nonasymptotic and asymptotic analyses quantifying the theoretical performances of ECA. In the nonsparse setting, we show that ECA’s performance is highly related to the effective rank of the covariance matrix. In the sparse setting, the results are twofold: (i) we show that the sparse ECA estimator based on a combinatoric program attains the optimal rate of convergence; (ii) based on some recent developments in estimating sparse leading eigenvectors, we show that a computationally efficient sparse ECA estimator attains the optimal rate of convergence under a suboptimal scaling. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 252-268 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1246366 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1246366 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:252-268 Template-Type: ReDIF-Article 1.0 Author-Name: Jiming Jiang Author-X-Name-First: Jiming Author-X-Name-Last: Jiang Author-Name: J. Sunil Rao Author-X-Name-First: J. Sunil Author-X-Name-Last: Rao Author-Name: Jie Fan Author-X-Name-First: Jie Author-X-Name-Last: Fan Author-Name: Thuan Nguyen Author-X-Name-First: Thuan Author-X-Name-Last: Nguyen Title: Classified Mixed Model Prediction Abstract: Many practical problems are related to prediction, where the main interest is at subject (e.g., personalized medicine) or (small) sub-population (e.g., small community) level. In such cases, it is possible to make substantial gains in prediction accuracy by identifying a class that a new subject belongs to. This way, the new subject is potentially associated with a random effect corresponding to the same class in the training data, so that method of mixed model prediction can be used to make the best prediction. We propose a new method, called classified mixed model prediction (CMMP), to achieve this goal. We develop CMMP for both prediction of mixed effects and prediction of future observations, and consider different scenarios where there may or may not be a “match” of the new subject among the training-data subjects. Theoretical and empirical studies are carried out to study the properties of CMMP, including prediction intervals based on CMMP, and its comparison with existing methods. In particular, we show that, even if the actual match does not exist between the class of the new observations and those of the training data, CMMP still helps in improving prediction accuracy. Two real-data examples are considered. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 269-279 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1246367 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1246367 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:269-279 Template-Type: ReDIF-Article 1.0 Author-Name: Stephen Reid Author-X-Name-First: Stephen Author-X-Name-Last: Reid Author-Name: Jonathan Taylor Author-X-Name-First: Jonathan Author-X-Name-Last: Taylor Author-Name: Robert Tibshirani Author-X-Name-First: Robert Author-X-Name-Last: Tibshirani Title: A General Framework for Estimation and Inference From Clusters of Features Abstract: Applied statistical problems often come with prespecified groupings to predictors. It is natural to test for the presence of simultaneous group-wide signal for groups in isolation, or for multiple groups together. Current tests for the presence of such signals include the classical F-test or a t-test on unsupervised group prototypes (either group centroids or first principal components). In this article, we propose test statistics that aim for power improvements over these classical approaches. In particular, we first create group prototypes, with reference to the response, and then test with likelihood ratio statistics incorporating only these prototypes. We propose a model, called the “prototype model,” which naturally models this two-step procedure. Furthermore, we introduce an inferential schema detailing the unique considerations for different combinations of prototype formation and univariate/multivariate testing models. The prototype model also suggests new applications to estimation and prediction. Prototype formation often relies on variable selection, which invalidates classical Gaussian test theory. We use recent advances in selective inference to account for selection in the prototyping step and retain test validity. Simulation experiments suggest that our testing procedure enjoys more power than do classical approaches. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 280-293 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1246368 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1246368 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:280-293 Template-Type: ReDIF-Article 1.0 Author-Name: Ling Ma Author-X-Name-First: Ling Author-X-Name-Last: Ma Author-Name: Rajeshwari Sundaram Author-X-Name-First: Rajeshwari Author-X-Name-Last: Sundaram Title: Analysis of Gap Times Based on Panel Count Data With Informative Observation Times and Unknown Start Time Abstract: In biomedical studies, one is often interested in repeat events with longitudinal observations occurring only intermittently, resulting in panel count data. The first stage of labor, measured through unit-increments of cervical dilation in pregnant women, provides such an example. Obstetricians are interested in assessing the gap time distribution of per-unit increments of cervical dilation for better management of labor process. Typically, only intermittent medical examinations for cervical dilation occur after (already dilated) women get admitted to hospital. The observation frequency is very likely correlated to how fast/slow she dilates. Thus, one could view such data as panel count data with informative observation times and unknown start time. Here, we propose semiparametric proportional rate models for the event process and the observation process, with a multiplicative subject-specific frailty variable capturing the correlation between the two processes. Inference procedures for the gap times between consecutive events are proposed when the start times are known as well when unknown, using likelihood-based approach and estimating equations. The methodology is assessed through simulation study and through large sample property. A detailed analysis using the proposed methods is applied to data from two studies: the Collaborative Perinatal Project and the Consortium on Safe Labor. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 294-305 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1246369 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1246369 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:294-305 Template-Type: ReDIF-Article 1.0 Author-Name: Emilie Devijver Author-X-Name-First: Emilie Author-X-Name-Last: Devijver Author-Name: Mélina Gallopin Author-X-Name-First: Mélina Author-X-Name-Last: Gallopin Title: Block-Diagonal Covariance Selection for High-Dimensional Gaussian Graphical Models Abstract: Gaussian graphical models are widely used to infer and visualize networks of dependencies between continuous variables. However, inferring the graph is difficult when the sample size is small compared to the number of variables. To reduce the number of parameters to estimate in the model, we propose a nonasymptotic model selection procedure supported by strong theoretical guarantees based on an oracle type inequality and a minimax lower bound. The covariance matrix of the model is approximated by a block-diagonal matrix. The structure of this matrix is detected by thresholding the sample covariance matrix, where the threshold is selected using the slope heuristic. Based on the block-diagonal structure of the covariance matrix, the estimation problem is divided into several independent problems: subsequently, the network of dependencies between variables is inferred using the graphical lasso algorithm in each block. The performance of the procedure is illustrated on simulated data. An application to a real gene expression dataset with a limited sample size is also presented: the dimension reduction allows attention to be objectively focused on interactions among smaller subsets of genes, leading to a more parsimonious and interpretable modular network. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 306-314 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1247002 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1247002 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:306-314 Template-Type: ReDIF-Article 1.0 Author-Name: Zhao Chen Author-X-Name-First: Zhao Author-X-Name-Last: Chen Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Title: Error Variance Estimation in Ultrahigh-Dimensional Additive Models Abstract: Error variance estimation plays an important role in statistical inference for high-dimensional regression models. This article concerns with error variance estimation in high-dimensional sparse additive model. We study the asymptotic behavior of the traditional mean squared errors, the naive estimate of error variance, and show that it may significantly underestimate the error variance due to spurious correlations that are even higher in nonparametric models than linear models. We further propose an accurate estimate for error variance in ultrahigh-dimensional sparse additive model by effectively integrating sure independence screening and refitted cross-validation techniques. The root n consistency and the asymptotic normality of the resulting estimate are established. We conduct Monte Carlo simulation study to examine the finite sample performance of the newly proposed estimate. A real data example is used to illustrate the proposed methodology. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 315-327 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1251440 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1251440 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:315-327 Template-Type: ReDIF-Article 1.0 Author-Name: Yin Xia Author-X-Name-First: Yin Author-X-Name-Last: Xia Author-Name: Tianxi Cai Author-X-Name-First: Tianxi Author-X-Name-Last: Cai Author-Name: T. Tony Cai Author-X-Name-First: T. Tony Author-X-Name-Last: Cai Title: Multiple Testing of Submatrices of a Precision Matrix With Applications to Identification of Between Pathway Interactions Abstract: Making accurate inference for gene regulatory networks, including inferring about pathway-by-pathway interactions, is an important and difficult task. Motivated by such genomic applications, we consider multiple testing for conditional dependence between subgroups of variables. Under a Gaussian graphical model framework, the problem is translated into simultaneous testing for a collection of submatrices of a high-dimensional precision matrix with each submatrix summarizing the dependence structure between two subgroups of variables.A novel multiple testing procedure is proposed and both theoretical and numerical properties of the procedure are investigated. Asymptotic null distribution of the test statistic for an individual hypothesis is established and the proposed multiple testing procedure is shown to asymptotically control the false discovery rate (FDR) and false discovery proportion (FDP) at the prespecified level under regularity conditions. Simulations show that the procedure works well in controlling the FDR and has good power in detecting the true interactions. The procedure is applied to a breast cancer gene expression study to identify between pathway interactions. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 328-339 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1251930 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1251930 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:328-339 Template-Type: ReDIF-Article 1.0 Author-Name: Jeffrey W. Miller Author-X-Name-First: Jeffrey W. Author-X-Name-Last: Miller Author-Name: Matthew T. Harrison Author-X-Name-First: Matthew T. Author-X-Name-Last: Harrison Title: Mixture Models With a Prior on the Number of Components Abstract: A natural Bayesian approach for mixture models with an unknown number of components is to take the usual finite mixture model with symmetric Dirichlet weights, and put a prior on the number of components—that is, to use a mixture of finite mixtures (MFM). The most commonly used method of inference for MFMs is reversible jump Markov chain Monte Carlo, but it can be nontrivial to design good reversible jump moves, especially in high-dimensional spaces. Meanwhile, there are samplers for Dirichlet process mixture (DPM) models that are relatively simple and are easily adapted to new applications. It turns out that, in fact, many of the essential properties of DPMs are also exhibited by MFMs—an exchangeable partition distribution, restaurant process, random measure representation, and stick-breaking representation—and crucially, the MFM analogues are simple enough that they can be used much like the corresponding DPM properties. Consequently, many of the powerful methods developed for inference in DPMs can be directly applied to MFMs as well; this simplifies the implementation of MFMs and can substantially improve mixing. We illustrate with real and simulated data, including high-dimensional gene expression data used to discriminate cancer subtypes. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 340-356 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1255636 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1255636 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:340-356 Template-Type: ReDIF-Article 1.0 Author-Name: Shengchun Kong Author-X-Name-First: Shengchun Author-X-Name-Last: Kong Author-Name: Bin Nan Author-X-Name-First: Bin Author-X-Name-Last: Nan Author-Name: John D. Kalbfleisch Author-X-Name-First: John D. Author-X-Name-Last: Kalbfleisch Author-Name: Rajiv Saran Author-X-Name-First: Rajiv Author-X-Name-Last: Saran Author-Name: Richard Hirth Author-X-Name-First: Richard Author-X-Name-Last: Hirth Title: Conditional Modeling of Longitudinal Data With Terminal Event Abstract: We consider a random effects model for longitudinal data with the occurrence of an informative terminal event that is subject to right censoring. Existing methods for analyzing such data include the joint modeling approach using latent frailty and the marginal estimating equation approach using inverse probability weighting; in both cases the effect of the terminal event on the response variable is not explicit and thus not easily interpreted. In contrast, we treat the terminal event time as a covariate in a conditional model for the longitudinal data, which provides a straightforward interpretation while keeping the usual relationship of interest between the longitudinally measured response variable and covariates for times that are far from the terminal event. A two-stage semiparametric likelihood-based approach is proposed for estimating the regression parameters; first, the conditional distribution of the right-censored terminal event time given other covariates is estimated and then the likelihood function for the longitudinal event given the terminal event and other regression parameters is maximized. The method is illustrated by numerical simulations and by analyzing medical cost data for patients with end-stage renal disease. Desirable asymptotic properties are provided. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 357-368 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1255637 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1255637 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:357-368 Template-Type: ReDIF-Article 1.0 Author-Name: BaoLuo Sun Author-X-Name-First: BaoLuo Author-X-Name-Last: Sun Author-Name: Eric J. Tchetgen Tchetgen Author-X-Name-First: Eric J. Author-X-Name-Last: Tchetgen Tchetgen Title: On Inverse Probability Weighting for Nonmonotone Missing at Random Data Abstract: The development of coherent missing data models to account for nonmonotone missing at random (MAR) data by inverse probability weighting (IPW) remains to date largely unresolved. As a consequence, IPW has essentially been restricted for use only in monotone MAR settings. We propose a class of models for nonmonotone missing data mechanisms that spans the MAR model, while allowing the underlying full data law to remain unrestricted. For parametric specifications within the proposed class, we introduce an unconstrained maximum likelihood estimator for estimating the missing data probabilities which is easily implemented using existing software. To circumvent potential convergence issues with this procedure, we also introduce a constrained Bayesian approach to estimate the missing data process which is guaranteed to yield inferences that respect all model restrictions. The efficiency of standard IPW estimation is improved by incorporating information from incomplete cases through an augmented estimating equation which is optimal within a large class of estimating equations. We investigate the finite-sample properties of the proposed estimators in extensive simulations and illustrate the new methodology in an application evaluating key correlates of preterm delivery for infants born to HIV-infected mothers in Botswana, Africa. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 369-379 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1256814 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1256814 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:369-379 Template-Type: ReDIF-Article 1.0 Author-Name: Quefeng Li Author-X-Name-First: Quefeng Author-X-Name-Last: Li Author-Name: Guang Cheng Author-X-Name-First: Guang Author-X-Name-Last: Cheng Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Yuyan Wang Author-X-Name-First: Yuyan Author-X-Name-Last: Wang Title: Embracing the Blessing of Dimensionality in Factor Models Abstract: Factor modeling is an essential tool for exploring intrinsic dependence structures among high-dimensional random variables. Much progress has been made for estimating the covariance matrix from a high-dimensional factor model. However, the blessing of dimensionality has not yet been fully embraced in the literature: much of the available data are often ignored in constructing covariance matrix estimates. If our goal is to accurately estimate a covariance matrix of a set of targeted variables, shall we employ additional data, which are beyond the variables of interest, in the estimation? In this article, we provide sufficient conditions for an affirmative answer, and further quantify its gain in terms of Fisher information and convergence rate. In fact, even an oracle-like result (as if all the factors were known) can be achieved when a sufficiently large number of variables is used. The idea of using data as much as possible brings computational challenges. A divide-and-conquer algorithm is thus proposed to alleviate the computational burden, and also shown not to sacrifice any statistical accuracy in comparison with a pooled analysis. Simulation studies further confirm our advocacy for the use of full data, and demonstrate the effectiveness of the above algorithm. Our proposal is applied to a microarray data example that shows empirical benefits of using more data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 380-389 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1256815 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1256815 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:380-389 Template-Type: ReDIF-Article 1.0 Author-Name: Fan Li Author-X-Name-First: Fan Author-X-Name-Last: Li Author-Name: Kari Lock Morgan Author-X-Name-First: Kari Lock Author-X-Name-Last: Morgan Author-Name: Alan M. Zaslavsky Author-X-Name-First: Alan M. Author-X-Name-Last: Zaslavsky Title: Balancing Covariates via Propensity Score Weighting Abstract: Covariate balance is crucial for unconfounded descriptive or causal comparisons. However, lack of balance is common in observational studies. This article considers weighting strategies for balancing covariates. We define a general class of weights—the balancing weights—that balance the weighted distributions of the covariates between treatment groups. These weights incorporate the propensity score to weight each group to an analyst-selected target population. This class unifies existing weighting methods, including commonly used weights such as inverse-probability weights as special cases. General large-sample results on nonparametric estimation based on these weights are derived. We further propose a new weighting scheme, the overlap weights, in which each unit’s weight is proportional to the probability of that unit being assigned to the opposite group. The overlap weights are bounded, and minimize the asymptotic variance of the weighted average treatment effect among the class of balancing weights. The overlap weights also possess a desirable small-sample exact balance property, based on which we propose a new method that achieves exact balance for means of any selected set of covariates. Two applications illustrate these methods and compare them with other approaches. Journal: Journal of the American Statistical Association Pages: 390-400 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1260466 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1260466 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:390-400 Template-Type: ReDIF-Article 1.0 Author-Name: Abhra Sarkar Author-X-Name-First: Abhra Author-X-Name-Last: Sarkar Author-Name: Debdeep Pati Author-X-Name-First: Debdeep Author-X-Name-Last: Pati Author-Name: Antik Chakraborty Author-X-Name-First: Antik Author-X-Name-Last: Chakraborty Author-Name: Bani K. Mallick Author-X-Name-First: Bani K. Author-X-Name-Last: Mallick Author-Name: Raymond J. Carroll Author-X-Name-First: Raymond J. Author-X-Name-Last: Carroll Title: Bayesian Semiparametric Multivariate Density Deconvolution Abstract: We consider the problem of multivariate density deconvolution when interest lies in estimating the distribution of a vector valued random variable X but precise measurements on X are not available, observations being contaminated by measurement errors U. The existing sparse literature on the problem assumes the density of the measurement errors to be completely known. We propose robust Bayesian semiparametric multivariate deconvolution approaches when the measurement error density of U is not known but replicated proxies are available for at least some individuals. Additionally, we allow the variability of U to depend on the associated unobserved values of X through unknown relationships, which also automatically includes the case of multivariate multiplicative measurement errors. Basic properties of finite mixture models, multivariate normal kernels, and exchangeable priors are exploited in novel ways to meet modeling and computational challenges. Theoretical results showing the flexibility of the proposed methods in capturing a wide variety of data-generating processes are provided. We illustrate the efficiency of the proposed methods in recovering the density of X through simulation experiments. The methodology is applied to estimate the joint consumption pattern of different dietary components from contaminated 24 h recalls. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 401-416 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1260467 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1260467 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:401-416 Template-Type: ReDIF-Article 1.0 Author-Name: Rajesh Ranganath Author-X-Name-First: Rajesh Author-X-Name-Last: Ranganath Author-Name: David M. Blei Author-X-Name-First: David M. Author-X-Name-Last: Blei Title: Correlated Random Measures Abstract: We develop correlated random measures, random measures where the atom weights can exhibit a flexible pattern of dependence, and use them to develop powerful hierarchical Bayesian nonparametric models. Hierarchical Bayesian nonparametric models are usually built from completely random measures, a Poisson-process-based construction in which the atom weights are independent. Completely random measures imply strong independence assumptions in the corresponding hierarchical model, and these assumptions are often misplaced in real-world settings. Correlated random measures address this limitation. They model correlation within the measure by using a Gaussian process in concert with the Poisson process. With correlated random measures, for example, we can develop a latent feature model for which we can infer both the properties of the latent features and their dependency pattern. We develop several other examples as well. We study a correlated random measure model of pairwise count data. We derive an efficient variational inference algorithm and show improved predictive performance on large datasets of documents, web clicks, and electronic health records. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 417-430 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1260468 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1260468 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:417-430 Template-Type: ReDIF-Article 1.0 Author-Name: Veronika Ročková Author-X-Name-First: Veronika Author-X-Name-Last: Ročková Author-Name: Edward I. George Author-X-Name-First: Edward I. Author-X-Name-Last: George Title: The Spike-and-Slab LASSO Abstract: Despite the wide adoption of spike-and-slab methodology for Bayesian variable selection, its potential for penalized likelihood estimation has largely been overlooked. In this article, we bridge this gap by cross-fertilizing these two paradigms with the Spike-and-Slab LASSO procedure for variable selection and parameter estimation in linear regression. We introduce a new class of self-adaptive penalty functions that arise from a fully Bayes spike-and-slab formulation, ultimately moving beyond the separable penalty framework. A virtue of these nonseparable penalties is their ability to borrow strength across coordinates, adapt to ensemble sparsity information and exert multiplicity adjustment. The Spike-and-Slab LASSO procedure harvests efficient coordinate-wise implementations with a path-following scheme for dynamic posterior exploration. We show on simulated data that the fully Bayes penalty mimics oracle performance, providing a viable alternative to cross-validation. We develop theory for the separable and nonseparable variants of the penalty, showing rate-optimality of the global mode as well as optimal posterior concentration when p > n. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 431-444 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1260469 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1260469 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:431-444 Template-Type: ReDIF-Article 1.0 Author-Name: Yiyuan She Author-X-Name-First: Yiyuan Author-X-Name-Last: She Author-Name: Zhifeng Wang Author-X-Name-First: Zhifeng Author-X-Name-Last: Wang Author-Name: He Jiang Author-X-Name-First: He Author-X-Name-Last: Jiang Title: Group Regularized Estimation Under Structural Hierarchy Abstract: Variable selection for models including interactions between explanatory variables often needs to obey certain hierarchical constraints. Weak or strong structural hierarchy requires that the existence of an interaction term implies at least one or both associated main effects to be present in the model. Lately, this problem has attracted a lot of attention, but existing computational algorithms converge slow even with a moderate number of predictors. Moreover, in contrast to the rich literature on ordinary variable selection, there is a lack of statistical theory to show reasonably low error rates of hierarchical variable selection. This work investigates a new class of estimators that make use of multiple group penalties to capture structural parsimony. We show that the proposed estimators enjoy sharp rate oracle inequalities, and give the minimax lower bounds in strong and weak hierarchical variable selection. A general-purpose algorithm is developed with guaranteed convergence and global optimality. Simulations and real data experiments demonstrate the efficiency and efficacy of the proposed approach. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 445-454 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1260470 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1260470 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:445-454 Template-Type: ReDIF-Article 1.0 Author-Name: Marco Battiston Author-X-Name-First: Marco Author-X-Name-Last: Battiston Author-Name: Stefano Favaro Author-X-Name-First: Stefano Author-X-Name-Last: Favaro Author-Name: Yee Whye Teh Author-X-Name-First: Yee Whye Author-X-Name-Last: Teh Title: Multi-Armed Bandit for Species Discovery: A Bayesian Nonparametric Approach Abstract: Let (P1, …, PJ) denote J populations of animals from distinct regions. A priori, it is unknown which species are present in each region and what are their corresponding frequencies. Species are shared among populations and each species can be present in more than one region with its frequency varying across populations. In this article, we consider the problem of sequentially sampling these populations to observe the greatest number of different species. We adopt a Bayesian nonparametric approach and endow (P1, …, PJ) with a hierarchical Pitman–Yor process prior. As a consequence of the hierarchical structure, the J unknown discrete probability measures share the same support, that of their common random base measure. Given this prior choice, we propose a sequential rule that, at every time step, given the information available up to that point, selects the population from which to collect the next observation. Rather than picking the population with the highest posterior estimate of producing a new value, the proposed rule includes a Thompson sampling step to better balance the exploration–exploitation trade-off. We also propose an extension of the algorithm to deal with incidence data, where multiple observations are collected in a time period. The performance of the proposed algorithms is assessed through a simulation study and compared to three other strategies. Finally, we compare these algorithms using a dataset of species of trees, collected from different plots in South America. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 455-466 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1261711 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1261711 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:455-466 Template-Type: ReDIF-Article 1.0 Author-Name: Pavel Krupskii Author-X-Name-First: Pavel Author-X-Name-Last: Krupskii Author-Name: Raphaël Huser Author-X-Name-First: Raphaël Author-X-Name-Last: Huser Author-Name: Marc G. Genton Author-X-Name-First: Marc G. Author-X-Name-Last: Genton Title: Factor Copula Models for Replicated Spatial Data Abstract: We propose a new copula model that can be used with replicated spatial data. Unlike the multivariate normal copula, the proposed copula is based on the assumption that a common factor exists and affects the joint dependence of all measurements of the process. Moreover, the proposed copula can model tail dependence and tail asymmetry. The model is parameterized in terms of a covariance function that may be chosen from the many models proposed in the literature, such as the Matérn model. For some choice of common factors, the joint copula density is given in closed form and therefore likelihood estimation is very fast. In the general case, one-dimensional numerical integration is needed to calculate the likelihood, but estimation is still reasonably fast even with large datasets. We use simulation studies to show the wide range of dependence structures that can be generated by the proposed model with different choices of common factors. We apply the proposed model to spatial temperature data and compare its performance with some popular geostatistics models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 467-479 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2016.1261712 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1261712 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:467-479 Template-Type: ReDIF-Article 1.0 Author-Name: Yuanjia Wang Author-X-Name-First: Yuanjia Author-X-Name-Last: Wang Author-Name: Haoda Fu Author-X-Name-First: Haoda Author-X-Name-Last: Fu Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Title: Learning Optimal Personalized Treatment Rules in Consideration of Benefit and Risk: With an Application to Treating Type 2 Diabetes Patients With Insulin Therapies Abstract: Individualized medical decision making is often complex due to patient treatment response heterogeneity. Pharmacotherapy may exhibit distinct efficacy and safety profiles for different patient populations. An “optimal” treatment that maximizes clinical benefit for a patient may also lead to concern of safety due to a high risk of adverse events. Thus, to guide individualized clinical decision making and deliver optimal tailored treatments, maximizing clinical benefit should be considered in the context of controlling for potential risk. In this work, we propose two approaches to identify personalized optimal treatment strategy that maximizes clinical benefit under a constraint on the average risk. We derive the theoretical optimal treatment rule under the risk constraint and draw an analogy to the Neyman–Pearson lemma to prove the theorem. We present algorithms that can be easily implemented by any off-the-shelf quadratic programming package. We conduct extensive simulation studies to show satisfactory risk control when maximizing the clinical benefit. Finally, we apply our method to a randomized trial of type 2 diabetes patients to guide optimal utilization of the first line insulin treatments based on individual patient characteristics while controlling for the rate of hypoglycemia events. We identify baseline glycated hemoglobin level, body mass index, and fasting blood glucose as three key factors among 18 biomarkers to differentiate treatment assignments, and demonstrate a successful control of the risk of hypoglycemia in both the training and testing dataset. Journal: Journal of the American Statistical Association Pages: 1-13 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1303386 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1303386 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:1-13 Template-Type: ReDIF-Article 1.0 Author-Name: Danielle Braun Author-X-Name-First: Danielle Author-X-Name-Last: Braun Author-Name: Malka Gorfine Author-X-Name-First: Malka Author-X-Name-Last: Gorfine Author-Name: Hormuzd A. Katki Author-X-Name-First: Hormuzd A. Author-X-Name-Last: Katki Author-Name: Argyrios Ziogas Author-X-Name-First: Argyrios Author-X-Name-Last: Ziogas Author-Name: Giovanni Parmigiani Author-X-Name-First: Giovanni Author-X-Name-Last: Parmigiani Title: Nonparametric Adjustment for Measurement Error in Time-to-Event Data: Application to Risk Prediction Models Abstract: Mismeasured time-to-event data used as a predictor in risk prediction models will lead to inaccurate predictions. This arises in the context of self-reported family history, a time-to-event predictor often measured with error, used in Mendelian risk prediction models. Using validation data, we propose a method to adjust for this type of error. We estimate the measurement error process using a nonparametric smoothed Kaplan–Meier estimator, and use Monte Carlo integration to implement the adjustment. We apply our method to simulated data in the context of both Mendelian and multivariate survival prediction models. Simulations are evaluated using measures of mean squared error of prediction (MSEP), area under the response operating characteristics curve (ROC-AUC), and the ratio of observed to expected number of events. These results show that our method mitigates the effects of measurement error mainly by improving calibration and total accuracy. We illustrate our method in the context of Mendelian risk prediction models focusing on misreporting of breast cancer, fitting the measurement error model on data from the University of California at Irvine, and applying our method to counselees from the Cancer Genetics Network. We show that our method improves overall calibration, especially in low risk deciles. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 14-25 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1311261 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1311261 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:14-25 Template-Type: ReDIF-Article 1.0 Author-Name: Jouni Kuha Author-X-Name-First: Jouni Author-X-Name-Last: Kuha Author-Name: Sarah Butt Author-X-Name-First: Sarah Author-X-Name-Last: Butt Author-Name: Myrsini Katsikatsou Author-X-Name-First: Myrsini Author-X-Name-Last: Katsikatsou Author-Name: Chris J. Skinner Author-X-Name-First: Chris J. Author-X-Name-Last: Skinner Title: The Effect of Probing “Don’t Know” Responses on Measurement Quality and Nonresponse in Surveys Abstract: In survey interviews, “Don’t know” (DK) responses are commonly treated as missing data. One way to reduce the rate of such responses is to probe initial DK answers with a follow-up question designed to encourage respondents to give substantive, non-DK responses. However, such probing can also reduce data quality by introducing additional or differential measurement error. We propose a latent variable model for analyzing the effects of probing on responses to survey questions. The model makes it possible to separate measurement effects of probing from true differences between respondents who do and do not require probing. We analyze new data from an experiment, which compared responses to two multi-item batteries of questions with and without probing. In this study, probing reduced the rate of DK responses by around a half. However, it also had substantial measurement effects, in that probed answers were often weaker measures of constructs of interest than were unprobed answers. These effects were larger for questions on attitudes than for pseudo-knowledge questions on perceptions of external facts. The results provide evidence against the use of probing of “Don’t know” responses, at least for the kinds of items and respondents considered in this study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 26-40 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1323640 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1323640 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:26-40 Template-Type: ReDIF-Article 1.0 Author-Name: Guillaume Basse Author-X-Name-First: Guillaume Author-X-Name-Last: Basse Author-Name: Avi Feller Author-X-Name-First: Avi Author-X-Name-Last: Feller Title: Analyzing Two-Stage Experiments in the Presence of Interference Abstract: Two-stage randomization is a powerful design for estimating treatment effects in the presence of interference; that is, when one individual’s treatment assignment affects another individual’s outcomes. Our motivating example is a two-stage randomized trial evaluating an intervention to reduce student absenteeism in the School District of Philadelphia. In that experiment, households with multiple students were first assigned to treatment or control; then, in treated households, one student was randomly assigned to treatment. Using this example, we highlight key considerations for analyzing two-stage experiments in practice. Our first contribution is to address additional complexities that arise when household sizes vary; in this case, researchers must decide between assigning equal weight to households or equal weight to individuals. We propose unbiased estimators for a broad class of individual- and household-weighted estimands, with corresponding theoretical and estimated variances. Our second contribution is to connect two common approaches for analyzing two-stage designs: linear regression and randomization inference. We show that, with suitably chosen standard errors, these two approaches yield identical point and variance estimates, which is somewhat surprising given the complex randomization scheme. Finally, we explore options for incorporating covariates to improve precision. We confirm our analytic results via simulation studies and apply these methods to the attendance study, finding substantively meaningful spillover effects. Journal: Journal of the American Statistical Association Pages: 41-55 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1323641 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1323641 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:41-55 Template-Type: ReDIF-Article 1.0 Author-Name: Maria DeYoreo Author-X-Name-First: Maria Author-X-Name-Last: DeYoreo Author-Name: Athanasios Kottas Author-X-Name-First: Athanasios Author-X-Name-Last: Kottas Title: Modeling for Dynamic Ordinal Regression Relationships: An Application to Estimating Maturity of Rockfish in California Abstract: We develop a Bayesian nonparametric framework for modeling ordinal regression relationships, which evolve in discrete time. The motivating application involves a key problem in fisheries research on estimating dynamically evolving relationships between age, length, and maturity, the latter recorded on an ordinal scale. The methodology builds from nonparametric mixture modeling for the joint stochastic mechanism of covariates and latent continuous responses. This approach yields highly flexible inference for ordinal regression functions while at the same time avoiding the computational challenges of parametric models that arise from estimation of cut-off points relating the latent continuous and ordinal responses. A novel-dependent Dirichlet process prior for time-dependent mixing distributions extends the model to the dynamic setting. The methodology is used for a detailed study of relationships between maturity, age, and length for Chilipepper rockfish, using data collected over 15 years along the coast of California. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 68-80 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1328357 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1328357 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:68-80 Template-Type: ReDIF-Article 1.0 Author-Name: Siamak Zamani Dadaneh Author-X-Name-First: Siamak Zamani Author-X-Name-Last: Dadaneh Author-Name: Xiaoning Qian Author-X-Name-First: Xiaoning Author-X-Name-Last: Qian Author-Name: Mingyuan Zhou Author-X-Name-First: Mingyuan Author-X-Name-Last: Zhou Title: BNP-Seq: Bayesian Nonparametric Differential Expression Analysis of Sequencing Count Data Abstract: We perform differential expression analysis of high-throughput sequencing count data under a Bayesian nonparametric framework, removing sophisticated ad hoc pre-processing steps commonly required in existing algorithms. We propose to use the gamma (beta) negative binomial process, which takes into account different sequencing depths using sample-specific negative binomial probability (dispersion) parameters, to detect differentially expressed genes by comparing the posterior distributions of gene-specific negative binomial dispersion (probability) parameters. These model parameters are inferred by borrowing statistical strength across both the genes and samples. Extensive experiments on both simulated and real-world RNA sequencing count data show that the proposed differential expression analysis algorithms clearly outperform previously proposed ones in terms of the areas under both the receiver operating characteristic and precision-recall curves. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 81-94 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1328358 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1328358 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:81-94 Template-Type: ReDIF-Article 1.0 Author-Name: Meredith L. Wallace Author-X-Name-First: Meredith L. Author-X-Name-Last: Wallace Author-Name: Daniel J. Buysse Author-X-Name-First: Daniel J. Author-X-Name-Last: Buysse Author-Name: Anne Germain Author-X-Name-First: Anne Author-X-Name-Last: Germain Author-Name: Martica H. Hall Author-X-Name-First: Martica H. Author-X-Name-Last: Hall Author-Name: Satish Iyengar Author-X-Name-First: Satish Author-X-Name-Last: Iyengar Title: Variable Selection for Skewed Model-Based Clustering: Application to the Identification of Novel Sleep Phenotypes Abstract: In sleep research, applying finite mixture models to sleep characteristics captured through multiple data types, including self-reported sleep diary, a wrist monitor capturing movement (actigraphy), and brain waves (polysomnography), may suggest new phenotypes that reflect underlying disease mechanisms. However, a direct mixture model application is challenging because there are many sleep variables from which to choose, and sleep variables are often highly skewed even in homogenous samples. Moreover, previous sleep research findings indicate that some of the most clinically interesting solutions will be those that incorporate all three data types. Thus, we present two novel skewed variable selection algorithms based on the multivariate skew normal (MSN) distribution: one that selects the best set of variables ignoring data type and another that embraces the exploratory nature of clustering and suggests multiple statistically plausible sets of variables that each incorporate all data types. Through a simulation study, we empirically compare our approach with other asymmetric and normal dimension reduction strategies for clustering. Finally, we demonstrate our methods using a sample of older adults with and without insomnia. The proposed MSN-based variable selection algorithm appears to be suitable for both MSN and multivariate normal cluster distributions, especially with moderate to large-sample sizes. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 95-110 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1330202 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1330202 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:95-110 Template-Type: ReDIF-Article 1.0 Author-Name: Ross P. Hilton Author-X-Name-First: Ross P. Author-X-Name-Last: Hilton Author-Name: Yuchen Zheng Author-X-Name-First: Yuchen Author-X-Name-Last: Zheng Author-Name: Nicoleta Serban Author-X-Name-First: Nicoleta Author-X-Name-Last: Serban Title: Modeling Heterogeneity in Healthcare Utilization Using Massive Medical Claims Data Abstract: We introduce a modeling approach for characterizing heterogeneity in healthcare utilization using massive medical claims data. We first translate the medical claims observed for a large study population and across five years into individual-level discrete events of care called utilization sequences. We model the utilization sequences using an exponential proportional hazards mixture model to capture heterogeneous behaviors in patients’ healthcare utilization. The objective is to cluster patients according to their longitudinal utilization behaviors and to determine the main drivers of variation in healthcare utilization while controlling for the demographic, geographic, and health characteristics of the patients. Due to the computational infeasibility of fitting a parametric proportional hazards model for high-dimensional, large-sample size data we use an iterative one-step procedure to estimate the model parameters and impute the cluster membership. The approach is used to draw inferences on utilization behaviors of children in the Medicaid system with persistent asthma across six states. We conclude with policy implications for targeted interventions to improve adherence to recommended care practices for pediatric asthma. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 111-121 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1330203 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1330203 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:111-121 Template-Type: ReDIF-Article 1.0 Author-Name: Peng Shi Author-X-Name-First: Peng Author-X-Name-Last: Shi Author-Name: Lu Yang Author-X-Name-First: Lu Author-X-Name-Last: Yang Title: Pair Copula Constructions for Insurance Experience Rating Abstract: In nonlife insurance, insurers use experience rating to adjust premiums to reflect policyholders’ previous claim experience. Performing prospective experience rating can be challenging when the claim distribution is complex. For instance, insurance claims are semicontinuous in that a fraction of zeros is often associated with an otherwise positive continuous outcome from a right-skewed and long-tailed distribution. Practitioners use credibility premium that is a special form of the shrinkage estimator in the longitudinal data framework. However, the linear predictor is not informative especially when the outcome follows a mixed distribution. In this article, we introduce a mixed vine pair copula construction framework for modeling semicontinuous longitudinal claims. In the proposed framework, a two-component mixture regression is employed to accommodate the zero inflation and thick tails in the claim distribution. The temporal dependence among repeated observations is modeled using a sequence of bivariate conditional copulas based on a mixed D-vine. We emphasize that the resulting predictive distribution allows insurers to incorporate past experience into future premiums in a nonlinear fashion and the classic linear predictor can be viewed as a nested case. In the application, we examine a unique claims dataset of government property insurance from the state of Wisconsin. Due to the discrepancies between the claim and premium distributions, we employ an ordered Lorenz curve to evaluate the predictive performance. We show that the proposed approach offers substantial opportunities for separating risks and identifying profitable business when compared with alternative experience rating methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 122-133 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1330692 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1330692 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:122-133 Template-Type: ReDIF-Article 1.0 Author-Name: Ryan Warnick Author-X-Name-First: Ryan Author-X-Name-Last: Warnick Author-Name: Michele Guindani Author-X-Name-First: Michele Author-X-Name-Last: Guindani Author-Name: Erik Erhardt Author-X-Name-First: Erik Author-X-Name-Last: Erhardt Author-Name: Elena Allen Author-X-Name-First: Elena Author-X-Name-Last: Allen Author-Name: Vince Calhoun Author-X-Name-First: Vince Author-X-Name-Last: Calhoun Author-Name: Marina Vannucci Author-X-Name-First: Marina Author-X-Name-Last: Vannucci Title: A Bayesian Approach for Estimating Dynamic Functional Network Connectivity in fMRI Data Abstract: Dynamic functional connectivity, that is, the study of how interactions among brain regions change dynamically over the course of an fMRI experiment, has recently received wide interest in the neuroimaging literature. Current approaches for studying dynamic connectivity often rely on ad hoc approaches for inference, with the fMRI time courses segmented by a sequence of sliding windows. We propose a principled Bayesian approach to dynamic functional connectivity, which is based on the estimation of time varying networks. Our method utilizes a hidden Markov model for classification of latent cognitive states, achieving estimation of the networks in an integrated framework that borrows strength over the entire time course of the experiment. Furthermore, we assume that the graph structures, which define the connectivity states at each time point, are related within a super-graph, to encourage the selection of the same edges among related graphs. We apply our method to simulated task -based fMRI data, where we show how our approach allows the decoupling of the task-related activations and the functional connectivity states. We also analyze data from an fMRI sensorimotor task experiment on an individual healthy subject and obtain results that support the role of particular anatomical regions in modulating interaction between executive control and attention networks. Journal: Journal of the American Statistical Association Pages: 134-151 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1379404 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1379404 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:134-151 Template-Type: ReDIF-Article 1.0 Author-Name: John C. Duchi Author-X-Name-First: John C. Author-X-Name-Last: Duchi Author-Name: Michael I. Jordan Author-X-Name-First: Michael I. Author-X-Name-Last: Jordan Author-Name: Martin J. Wainwright Author-X-Name-First: Martin J. Author-X-Name-Last: Wainwright Title: Minimax Optimal Procedures for Locally Private Estimation Abstract: Working under a model of privacy in which data remain private even from the statistician, we study the tradeoff between privacy guarantees and the risk of the resulting statistical estimators. We develop private versions of classical information-theoretical bounds, in particular those due to Le Cam, Fano, and Assouad. These inequalities allow for a precise characterization of statistical rates under local privacy constraints and the development of provably (minimax) optimal estimation procedures. We provide a treatment of several canonical families of problems: mean estimation and median estimation, generalized linear models, and nonparametric density estimation. For all of these families, we provide lower and upper bounds that match up to constant factors, and exhibit new (optimal) privacy-preserving mechanisms and computationally efficient estimators that achieve the bounds. Additionally, we present a variety of experimental results for estimation problems involving sensitive data, including salaries, censored blog posts and articles, and drug abuse; these experiments demonstrate the importance of deriving optimal procedures. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 182-201 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1389735 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1389735 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:182-201 Template-Type: ReDIF-Article 1.0 Author-Name: Joseph Guinness Author-X-Name-First: Joseph Author-X-Name-Last: Guinness Author-Name: Dorit Hammerling Author-X-Name-First: Dorit Author-X-Name-Last: Hammerling Title: Compression and Conditional Emulation of Climate Model Output Abstract: Numerical climate model simulations run at high spatial and temporal resolutions generate massive quantities of data. As our computing capabilities continue to increase, storing all of the data is not sustainable, and thus it is important to develop methods for representing the full datasets by smaller compressed versions. We propose a statistical compression and decompression algorithm based on storing a set of summary statistics as well as a statistical model describing the conditional distribution of the full dataset given the summary statistics. We decompress the data by computing conditional expectations and conditional simulations from the model given the summary statistics. Conditional expectations represent our best estimate of the original data but are subject to oversmoothing in space and time. Conditional simulations introduce realistic small-scale noise so that the decompressed fields are neither too smooth nor too rough compared with the original data. Considerable attention is paid to accurately modeling the original dataset—1 year of daily mean temperature data—particularly with regard to the inherent spatial nonstationarity in global fields, and to determining the statistics to be stored, so that the variation in the original data can be closely captured, while allowing for fast decompression and conditional emulation on modest computers. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 56-67 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1395339 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1395339 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:56-67 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Corrigendum Journal: Journal of the American Statistical Association Pages: 486-486 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1395340 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1395340 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:486-486 Template-Type: ReDIF-Article 1.0 Author-Name: Petra M. Kuhnert Author-X-Name-First: Petra M. Author-X-Name-Last: Kuhnert Title: Comment Journal: Journal of the American Statistical Association Pages: 168-170 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1415904 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1415904 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:168-170 Template-Type: ReDIF-Article 1.0 Author-Name: William F. Christensen Author-X-Name-First: William F. Author-X-Name-Last: Christensen Author-Name: C. Shane Reese Author-X-Name-First: C. Shane Author-X-Name-Last: Reese Title: Comment Journal: Journal of the American Statistical Association Pages: 171-173 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1415905 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1415905 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:171-173 Template-Type: ReDIF-Article 1.0 Author-Name: Noel Cressie Author-X-Name-First: Noel Author-X-Name-Last: Cressie Title: Mission CO2ntrol: A Statistical Scientist's Role in Remote Sensing of Atmospheric Carbon Dioxide Abstract: Too much carbon dioxide (CO2) in the atmosphere is a threat to long-term sustainability of Earth's ecosystem. Atmospheric CO2 is a leading greenhouse gas that has increased to levels not seen since the middle Pliocene (approximately 3.6 million years ago). One of the US National Aeronautics Space Administration's (NASA) remote sensing missions is the Orbiting Carbon Observatory-2, whose principal science objective is to estimate the global geographic distribution of CO2 sources and sinks at Earth's surface, through time. This starts with raw radiances (Level 1), moves on to retrievals of the atmospheric state (Level 2), from which maps of gap-filled and de-noised geophysical variables and their uncertainties are made (Level 3). With the aid of a model of transport in the atmosphere, CO2 fluxes (Level 4) can be obtained from Level 2 data directly or possibly through Level 3. Decisions about how to mitigate or manage CO2 could be thought of as Level 5. Hierarchical statistical modeling is used to qualify and quantify the uncertainties at each level. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 152-168 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1419136 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1419136 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:152-168 Template-Type: ReDIF-Article 1.0 Author-Name: Marc G. Genton Author-X-Name-First: Marc G. Author-X-Name-Last: Genton Author-Name: Jaehong Jeong Author-X-Name-First: Jaehong Author-X-Name-Last: Jeong Title: Comment Journal: Journal of the American Statistical Association Pages: 176-178 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1419137 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1419137 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:176-178 Template-Type: ReDIF-Article 1.0 Author-Name: Frédéric Chevallier Author-X-Name-First: Frédéric Author-X-Name-Last: Chevallier Author-Name: François-Marie Bréon Author-X-Name-First: François-Marie Author-X-Name-Last: Bréon Title: Comment Abstract: Based on the measurements of the OCO-2 satellite, Noel Cressie addresses a particularly hard challenge for Earth observation, arguably an extreme case in remote sensing. He is one of the very few who has expertise in most of the processing chain and his article brilliantly discusses the diverse underlying statistical challenges. In this comment, we provide a complementary view of the topic to qualify its prospects as drawn by N. Cressie at the end of his article. We first summarize the motivation of OCO-2-type programs; we then expose the corresponding challenges before discussing the prospects. Journal: Journal of the American Statistical Association Pages: 173-175 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1419138 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1419138 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:173-175 Template-Type: ReDIF-Article 1.0 Author-Name: Noel Cressie Author-X-Name-First: Noel Author-X-Name-Last: Cressie Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 178-181 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2017.1421541 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1421541 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:178-181 Template-Type: ReDIF-Article 1.0 Author-Name: Anderson Y. Zhang Author-X-Name-First: Anderson Y. Author-X-Name-Last: Zhang Author-Name: Harrison H. Zhou Author-X-Name-First: Harrison H. Author-X-Name-Last: Zhou Title: Comment Journal: Journal of the American Statistical Association Pages: 201-203 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2018.1442605 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442605 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:201-203 Template-Type: ReDIF-Article 1.0 Author-Name: Alfred Hero Author-X-Name-First: Alfred Author-X-Name-Last: Hero Title: Comment Journal: Journal of the American Statistical Association Pages: 203-204 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2018.1442606 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442606 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:203-204 Template-Type: ReDIF-Article 1.0 Author-Name: Vishesh Karwa Author-X-Name-First: Vishesh Author-X-Name-Last: Karwa Title: Comment Journal: Journal of the American Statistical Association Pages: 204-207 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2018.1442607 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442607 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:204-207 Template-Type: ReDIF-Article 1.0 Author-Name: Moritz Hardt Author-X-Name-First: Moritz Author-X-Name-Last: Hardt Title: Comment Journal: Journal of the American Statistical Association Pages: 207-208 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2018.1442608 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442608 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:207-208 Template-Type: ReDIF-Article 1.0 Author-Name: Aaron Roth Author-X-Name-First: Aaron Author-X-Name-Last: Roth Title: Comment Journal: Journal of the American Statistical Association Pages: 208-211 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2018.1442610 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442610 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:208-211 Template-Type: ReDIF-Article 1.0 Author-Name: John C. Duchi Author-X-Name-First: John C. Author-X-Name-Last: Duchi Author-Name: Michael I. Jordan Author-X-Name-First: Michael I. Author-X-Name-Last: Jordan Author-Name: Martin J. Wainwright Author-X-Name-First: Martin J. Author-X-Name-Last: Wainwright Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 212-215 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2018.1442611 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442611 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:212-215 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Erratum Journal: Journal of the American Statistical Association Pages: 487-487 Issue: 521 Volume: 113 Year: 2018 Month: 1 X-DOI: 10.1080/01621459.2018.1460558 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1460558 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:487-487 Template-Type: ReDIF-Article 1.0 Author-Name: Edoardo M. Airoldi Author-X-Name-First: Edoardo M. Author-X-Name-Last: Airoldi Author-Name: Jonathan M. Bischof Author-X-Name-First: Jonathan M. Author-X-Name-Last: Bischof Title: Improving and Evaluating Topic Models and Other Models of Text Abstract: An ongoing challenge in the analysis of document collections is how to summarize content in terms of a set of inferred themes that can be interpreted substantively in terms of topics. The current practice of parameterizing the themes in terms of most frequent words limits interpretability by ignoring the differential use of words across topics. Here, we show that words that are both frequent and exclusive to a theme are more effective at characterizing topical content, and we propose a regularization scheme that leads to better estimates of these quantities. We consider a supervised setting where professional editors have annotated documents to topic categories, organized into a tree, in which leaf-nodes correspond to more specific topics. Each document is annotated to multiple categories, at different levels of the tree. We introduce a hierarchical Poisson convolution model to analyze these annotated documents. A parallelized Hamiltonian Monte Carlo sampler allows the inference to scale to millions of documents. The model leverages the structure among categories defined by professional editors to infer a clear semantic description for each topic in terms of words that are both frequent and exclusive. In this supervised setting, we validate the efficacy of word frequency and exclusivity at characterizing topical content on two very large collections of documents, from Reuters and the New York Times. In an unsupervised setting, we then consider a simplified version of the model that shares the same regularization scheme with the previous model. We carry out a large randomized experiment on Amazon Mechanical Turk to demonstrate that topic summaries based on frequency and exclusivity, estimated using the proposed regularization scheme, are more interpretable than currently established frequency-based summaries, and that the proposed model produces more efficient estimates of exclusivity than the currently established models. Journal: Journal of the American Statistical Association Pages: 1381-1403 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1051182 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1051182 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1381-1403 Template-Type: ReDIF-Article 1.0 Author-Name: Xinlei Wang Author-X-Name-First: Xinlei Author-X-Name-Last: Wang Author-Name: Johan Lim Author-X-Name-First: Johan Author-X-Name-Last: Lim Author-Name: Lynne Stokes Author-X-Name-First: Lynne Author-X-Name-Last: Stokes Title: Using Ranked Set Sampling With Cluster Randomized Designs for Improved Inference on Treatment Effects Abstract: This article examines the use of ranked set sampling (RSS) with cluster randomized designs (CRDs), for potential improvement in estimation and detection of treatment or intervention effects. Outcome data in cluster randomized studies typically have nested structures, where hierarchical linear models (HLMs) become a natural choice for data analysis. However, nearly all theoretical developments in RSS to date are within the structure of one-level models. Thus, implementation of RSS at one or more levels of an HLM will require development of new theory and methods. Under RSS-structured CRDs developed to incorporate RSS at different levels, a nonparametric estimator of the treatment effect is proposed; and its theoretical properties are studied under a general HLM that has almost no distributional assumptions. We formally quantify the magnitude of the improvement from using RSS over SRS (simple random sampling), investigate the relationship between design parameters and relative efficiency, and establish connections with one-level RSS under completely balanced CRDs, as well as studying the impact of clustering and imperfect ranking. Further, based on the proposed RSS estimator, a new test is constructed to detect treatment effects, which is distribution-free and easy to use. Simulation studies confirm that in general, the proposed test is more powerful than the conventional F-test for the original CRDs, especially for small or medium effect sizes. Two empirical studies, one using data from educational research (i.e., the motivating application) and the other using human dental data, show that our methods work well in real-world settings and our theory provides useful predictions at the stage of experimental design, and that substantial gains may be obtained from using RSS at either level. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1576-1590 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1093946 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1093946 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1576-1590 Template-Type: ReDIF-Article 1.0 Author-Name: Patrick R. Conrad Author-X-Name-First: Patrick R. Author-X-Name-Last: Conrad Author-Name: Youssef M. Marzouk Author-X-Name-First: Youssef M. Author-X-Name-Last: Marzouk Author-Name: Natesh S. Pillai Author-X-Name-First: Natesh S. Author-X-Name-Last: Pillai Author-Name: Aaron Smith Author-X-Name-First: Aaron Author-X-Name-Last: Smith Title: Accelerating Asymptotically Exact MCMC for Computationally Intensive Models via Local Approximations Abstract: We construct a new framework for accelerating Markov chain Monte Carlo in posterior sampling problems where standard methods are limited by the computational cost of the likelihood, or of numerical models embedded therein. Our approach introduces local approximations of these models into the Metropolis–Hastings kernel, borrowing ideas from deterministic approximation theory, optimization, and experimental design. Previous efforts at integrating approximate models into inference typically sacrifice either the sampler’s exactness or efficiency; our work seeks to address these limitations by exploiting useful convergence characteristics of local approximations. We prove the ergodicity of our approximate Markov chain, showing that it samples asymptotically from the exact posterior distribution of interest. We describe variations of the algorithm that employ either local polynomial approximations or local Gaussian process regressors. Our theoretical results reinforce the key observation underlying this article: when the likelihood has some local regularity, the number of model evaluations per Markov chain Monte Carlo (MCMC) step can be greatly reduced without biasing the Monte Carlo average. Numerical experiments demonstrate multiple order-of-magnitude reductions in the number of forward model evaluations used in representative ordinary differential equation (ODE) and partial differential equation (PDE) inference problems, with both synthetic and real data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1591-1607 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1096787 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1096787 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1591-1607 Template-Type: ReDIF-Article 1.0 Author-Name: Veronika Ročková Author-X-Name-First: Veronika Author-X-Name-Last: Ročková Author-Name: Edward I. George Author-X-Name-First: Edward I. Author-X-Name-Last: George Title: Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity Abstract: Rotational post hoc transformations have traditionally played a key role in enhancing the interpretability of factor analysis. Regularization methods also serve to achieve this goal by prioritizing sparse loading matrices. In this work, we bridge these two paradigms with a unifying Bayesian framework. Our approach deploys intermediate factor rotations throughout the learning process, greatly enhancing the effectiveness of sparsity inducing priors. These automatic rotations to sparsity are embedded within a PXL-EM algorithm, a Bayesian variant of parameter-expanded EM for posterior mode detection. By iterating between soft-thresholding of small factor loadings and transformations of the factor basis, we obtain (a) dramatic accelerations, (b) robustness against poor initializations, and (c) better oriented sparse solutions. To avoid the prespecification of the factor cardinality, we extend the loading matrix to have infinitely many columns with the Indian buffet process (IBP) prior. The factor dimensionality is learned from the posterior, which is shown to concentrate on sparse matrices. Our deployment of PXL-EM performs a dynamic posterior exploration, outputting a solution path indexed by a sequence of spike-and-slab priors. For accurate recovery of the factor loadings, we deploy the spike-and-slab LASSO prior, a two-component refinement of the Laplace prior. A companion criterion, motivated as an integral lower bound, is provided to effectively select the best recovery. The potential of the proposed procedure is demonstrated on both simulated and real high-dimensional data, which would render posterior simulation impractical. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1608-1622 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1100620 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1100620 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1608-1622 Template-Type: ReDIF-Article 1.0 Author-Name: Ville A. Satopää Author-X-Name-First: Ville A. Author-X-Name-Last: Satopää Author-Name: Robin Pemantle Author-X-Name-First: Robin Author-X-Name-Last: Pemantle Author-Name: Lyle H. Ungar Author-X-Name-First: Lyle H. Author-X-Name-Last: Ungar Title: Modeling Probability Forecasts via Information Diversity Abstract: Randomness in scientific estimation is generally assumed to arise from unmeasured or uncontrolled factors. However, when combining subjective probability estimates, heterogeneity stemming from people’s cognitive or information diversity is often more important than measurement noise. This article presents a novel framework that uses partially overlapping information sources. A specific model is proposed within that framework and applied to the task of aggregating the probabilities given by a group of forecasters who predict whether an event will occur or not. Our model describes the distribution of information across forecasters in terms of easily interpretable parameters and shows how the optimal amount of extremizing of the average probability forecast (shifting it closer to its nearest extreme) varies as a function of the forecasters’ information overlap. Our model thus gives a more principled understanding of the historically ad hoc practice of extremizing average forecasts. Supplementary material for this article is available online. Journal: Journal of the American Statistical Association Pages: 1623-1633 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1100621 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1100621 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1623-1633 Template-Type: ReDIF-Article 1.0 Author-Name: Pramita Bagchi Author-X-Name-First: Pramita Author-X-Name-Last: Bagchi Author-Name: Moulinath Banerjee Author-X-Name-First: Moulinath Author-X-Name-Last: Banerjee Author-Name: Stilian A. Stoev Author-X-Name-First: Stilian A. Author-X-Name-Last: Stoev Title: Inference for Monotone Functions Under Short- and Long-Range Dependence: Confidence Intervals and New Universal Limits Abstract: We introduce new point-wise confidence interval estimates for monotone functions observed with additive, dependent noise. Our methodology applies to both short- and long-range dependence regimes for the errors. The interval estimates are obtained via the method of inversion of certain discrepancy statistics. This approach avoids the estimation of nuisance parameters such as the derivative of the unknown function, which previous methods are forced to deal with. The resulting estimates are therefore more accurate, stable, and widely applicable in practice under minimal assumptions on the trend and error structure. The dependence of the errors especially long-range dependence leads to new phenomena, where new universal limits based on convex minorant functionals of drifted fractional Brownian motion emerge. Some extensions to uniform confidence bands are also developed. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1634-1647 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1100622 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1100622 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1634-1647 Template-Type: ReDIF-Article 1.0 Author-Name: Pietro Coretto Author-X-Name-First: Pietro Author-X-Name-Last: Coretto Author-Name: Christian Hennig Author-X-Name-First: Christian Author-X-Name-Last: Hennig Title: Robust Improper Maximum Likelihood: Tuning, Computation, and a Comparison With Other Methods for Robust Gaussian Clustering Abstract: The two main topics of this article are the introduction of the “optimally tuned robust improper maximum likelihood estimator” (OTRIMLE) for robust clustering based on the multivariate Gaussian model for clusters, and a comprehensive simulation study comparing the OTRIMLE to maximum likelihood in Gaussian mixtures with and without noise component, mixtures of t-distributions, and the TCLUST approach for trimmed clustering. The OTRIMLE uses an improper constant density for modeling outliers and noise. This can be chosen optimally so that the nonnoise part of the data looks as close to a Gaussian mixture as possible. Some deviation from Gaussianity can be traded in for lowering the estimated noise proportion. Covariance matrix constraints and computation of the OTRIMLE are also treated. In the simulation study, all methods are confronted with setups in which their model assumptions are not exactly fulfilled, and to evaluate the experiments in a standardized way by misclassification rates, a new model-based definition of “true clusters” is introduced that deviates from the usual identification of mixture components with clusters. In the study, every method turns out to be superior for one or more setups, but the OTRIMLE achieves the most satisfactory overall performance. The methods are also applied to two real datasets, one without and one with known “true” clusters. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1648-1659 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1100996 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1100996 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1648-1659 Template-Type: ReDIF-Article 1.0 Author-Name: Rebecca C. Steorts Author-X-Name-First: Rebecca C. Author-X-Name-Last: Steorts Author-Name: Rob Hall Author-X-Name-First: Rob Author-X-Name-Last: Hall Author-Name: Stephen E. Fienberg Author-X-Name-First: Stephen E. Author-X-Name-Last: Fienberg Title: A Bayesian Approach to Graphical Record Linkage and Deduplication Abstract: We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1660-1672 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1105807 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1105807 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1660-1672 Template-Type: ReDIF-Article 1.0 Author-Name: Wang Miao Author-X-Name-First: Wang Author-X-Name-Last: Miao Author-Name: Peng Ding Author-X-Name-First: Peng Author-X-Name-Last: Ding Author-Name: Zhi Geng Author-X-Name-First: Zhi Author-X-Name-Last: Geng Title: Identifiability of Normal and Normal Mixture Models with Nonignorable Missing Data Abstract: Missing data problems arise in many applied research studies. They may jeopardize statistical inference of the model of interest, if the missing mechanism is nonignorable, that is, the missing mechanism depends on the missing values themselves even conditional on the observed data. With a nonignorable missing mechanism, the model of interest is often not identifiable without imposing further assumptions. We find that even if the missing mechanism has a known parametric form, the model is not identifiable without specifying a parametric outcome distribution. Although it is fundamental for valid statistical inference, identifiability under nonignorable missing mechanisms is not established for many commonly used models. In this article, we first demonstrate identifiability of the normal distribution under monotone missing mechanisms. We then extend it to the normal mixture and t mixture models with nonmonotone missing mechanisms. We discover that models under the Logistic missing mechanism are less identifiable than those under the Probit missing mechanism. We give necessary and sufficient conditions for identifiability of models under the Logistic missing mechanism, which sometimes can be checked in real data analysis. We illustrate our methods using a series of simulations, and apply them to a real-life dataset. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1673-1683 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1105808 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1105808 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1673-1683 Template-Type: ReDIF-Article 1.0 Author-Name: Valentin Patilea Author-X-Name-First: Valentin Author-X-Name-Last: Patilea Author-Name: César Sánchez-Sellero Author-X-Name-First: César Author-X-Name-Last: Sánchez-Sellero Author-Name: Matthieu Saumard Author-X-Name-First: Matthieu Author-X-Name-Last: Saumard Title: Testing the Predictor Effect on a Functional Response Abstract: This article examines the problem of nonparametric testing for the no-effect of a random covariate (or predictor) on a functional response. This means testing whether the conditional expectation of the response given the covariate is almost surely zero or not, without imposing any model relating response and covariate. The covariate could be univariate, multivariate, or functional. Our test statistic is a quadratic form involving univariate nearest neighbor smoothing and the asymptotic critical values are given by the standard normal law. When the covariate is multidimensional or functional, a preliminary dimension reduction device is used, which allows the effect of the covariate to be summarized into a univariate random quantity. The test is able to detect not only linear but nonparametric alternatives. The responses could have conditional variance of unknown form and the law of the covariate does not need to be known. An empirical study with simulated and real data shows that the test performs well in applications. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1684-1695 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1110031 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1110031 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1684-1695 Template-Type: ReDIF-Article 1.0 Author-Name: Bing-Yi Jing Author-X-Name-First: Bing-Yi Author-X-Name-Last: Jing Author-Name: Zhouping Li Author-X-Name-First: Zhouping Author-X-Name-Last: Li Author-Name: Guangming Pan Author-X-Name-First: Guangming Author-X-Name-Last: Pan Author-Name: Wang Zhou Author-X-Name-First: Wang Author-X-Name-Last: Zhou Title: On SURE-Type Double Shrinkage Estimation Abstract: The article is concerned with empirical Bayes shrinkage estimators for the heteroscedastic hierarchical normal model using Stein's unbiased estimate of risk (SURE). Recently, Xie, Kou, and Brown proposed a class of estimators for this type of problems and established their asymptotic optimality properties under the assumption of known but unequal variances. In this article, we consider this problem with unequal and unknown variances, which may be more appropriate in real situations. By placing priors for both means and variances, we propose novel SURE-type double shrinkage estimators that shrink both means and variances. Optimal properties for these estimators are derived under certain regularity conditions. Extensive simulation studies are conducted to compare the newly developed methods with other shrinkage techniques. Finally, the methods are applied to the well-known baseball dataset and a gene expression dataset. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1696-1704 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1110032 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1110032 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1696-1704 Template-Type: ReDIF-Article 1.0 Author-Name: Naveen N. Narisetty Author-X-Name-First: Naveen N. Author-X-Name-Last: Narisetty Author-Name: Vijayan N. Nair Author-X-Name-First: Vijayan N. Author-X-Name-Last: Nair Title: Extremal Depth for Functional Data and Applications Abstract: We propose a new notion called “extremal depth” (ED) for functional data, discuss its properties, and compare its performance with existing concepts. The proposed notion is based on a measure of extreme “outlyingness.” ED has several desirable properties that are not shared by other notions and is especially well suited for obtaining central regions of functional data and function spaces. In particular: (a) the central region achieves the nominal (desired) simultaneous coverage probability; (b) there is a correspondence between ED-based (simultaneous) central regions and appropriate pointwise central regions; and (c) the method is resistant to certain classes of functional outliers. The article examines the performance of ED and compares it with other depth notions. Its usefulness is demonstrated through applications to constructing central regions, functional boxplots, outlier detection, and simultaneous confidence bands in regression problems. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1705-1714 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1110033 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1110033 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1705-1714 Template-Type: ReDIF-Article 1.0 Author-Name: Pier Luigi Conti Author-X-Name-First: Pier Luigi Author-X-Name-Last: Conti Author-Name: Daniela Marella Author-X-Name-First: Daniela Author-X-Name-Last: Marella Author-Name: Mauro Scanu Author-X-Name-First: Mauro Author-X-Name-Last: Scanu Title: Statistical Matching Analysis for Complex Survey Data With Applications Abstract: The goal of statistical matching is the estimation of a joint distribution having observed only samples from its marginals. The lack of joint observations on the variables of interest is the reason of uncertainty about the joint population distribution function. In the present article, the notion of matching error is introduced, and upper-bounded via an appropriate measure of uncertainty. Then, an estimate of the distribution function for the variables not jointly observed is constructed on the basis of a modification of the conditional independence assumption in the presence of logical constraints. The corresponding measure of uncertainty is estimated via sample data. Finally, a simulation study is performed, and an application to a real case is provided. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1715-1725 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1112803 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1112803 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1715-1725 Template-Type: ReDIF-Article 1.0 Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Lingzhou Xue Author-X-Name-First: Lingzhou Author-X-Name-Last: Xue Author-Name: Hui Zou Author-X-Name-First: Hui Author-X-Name-Last: Zou Title: Multitask Quantile Regression Under the Transnormal Model Abstract: We consider estimating multitask quantile regression under the transnormal model, with focus on high-dimensional setting. We derive a surprisingly simple closed-form solution through rank-based covariance regularization. In particular, we propose the rank-based ℓ1 penalization with positive-definite constraints for estimating sparse covariance matrices, and the rank-based banded Cholesky decomposition regularization for estimating banded precision matrices. By taking advantage of the alternating direction method of multipliers, nearest correlation matrix projection is introduced that inherits sampling properties of the unprojected one. Our work combines strengths of quantile regression and rank-based covariance regularization to simultaneously deal with nonlinearity and nonnormality for high-dimensional regression. Furthermore, the proposed method strikes a good balance between robustness and efficiency, achieves the “oracle”-like convergence rate, and provides the provable prediction interval under the high-dimensional setting. The finite-sample performance of the proposed method is also examined. The performance of our proposed rank-based method is demonstrated in a real application to analyze the protein mass spectroscopy data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1726-1735 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1113973 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1113973 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1726-1735 Template-Type: ReDIF-Article 1.0 Author-Name: J. D. Godolphin Author-X-Name-First: J. D. Author-X-Name-Last: Godolphin Title: A Link Between the -Value and the Robustness of Block Designs Abstract: This article investigates the robustness of binary incomplete block designs against giving rise to a disconnected design in the event of observation loss. A link is established between the E-value of a planned design and the extent of observation loss that can be experienced while still guaranteeing an eventual design from which all treatment contrasts can be estimated. Patterns of missing observations covered include loss of entire blocks and loss of individual observations. Simple bounds are provided enabling practitioners to easily assess the robustness of a planned design. Journal: Journal of the American Statistical Association Pages: 1736-1745 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1114949 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1114949 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1736-1745 Template-Type: ReDIF-Article 1.0 Author-Name: Botond Cseke Author-X-Name-First: Botond Author-X-Name-Last: Cseke Author-Name: Andrew Zammit-Mangion Author-X-Name-First: Andrew Author-X-Name-Last: Zammit-Mangion Author-Name: Tom Heskes Author-X-Name-First: Tom Author-X-Name-Last: Heskes Author-Name: Guido Sanguinetti Author-X-Name-First: Guido Author-X-Name-Last: Sanguinetti Title: Sparse Approximate Inference for Spatio-Temporal Point Process Models Abstract: Spatio-temporal log-Gaussian Cox process models play a central role in the analysis of spatially distributed systems in several disciplines. Yet, scalable inference remains computationally challenging both due to the high-resolution modeling generally required and the analytically intractable likelihood function. Here, we exploit the sparsity structure typical of (spatially) discretized log-Gaussian Cox process models by using approximate message-passing algorithms. The proposed algorithms scale well with the state dimension and the length of the temporal horizon with moderate loss in distributional accuracy. They hence provide a flexible and faster alternative to both nonlinear filtering-smoothing type algorithms and to approaches that implement the Laplace method or expectation propagation on (block) sparse latent Gaussian models. We infer the parameters of the latent Gaussian model using a structured variational Bayes approach. We demonstrate the proposed framework on simulation studies with both Gaussian and point-process observations and use it to reconstruct the conflict intensity and dynamics in Afghanistan from the WikiLeaks Afghan War Diary. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1746-1763 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1115357 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1115357 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1746-1763 Template-Type: ReDIF-Article 1.0 Author-Name: Claudio Agostinelli Author-X-Name-First: Claudio Author-X-Name-Last: Agostinelli Author-Name: Víctor J. Yohai Author-X-Name-First: Víctor J. Author-X-Name-Last: Yohai Title: Composite Robust Estimators for Linear Mixed Models Abstract: The classical Tukey–Huber contamination model (CCM) is a commonly adopted framework to describe the mechanism of outliers generation in robust statistics. Given a dataset with n observations and p variables, under the CCM, an outlier is a unit, even if only one or a few values are corrupted. Classical robust procedures were designed to cope with this type of outliers. Recently, a new mechanism of outlier generation was introduced, namely, the independent contamination model (ICM), where the occurrences that each cell of the data matrix is an outlier are independent events and have the same probability. ICM poses new challenges to robust statistics since the percentage of contaminated rows dramatically increase with p, often reaching more than 50% whereas classical affine equivariant robust procedures have a breakdown point of 50% at most. For ICM, we propose a new type of robust methods, namely, composite robust procedures that are inspired by the idea of composite likelihood, where low-dimension likelihood, very often the likelihood of pairs, are aggregated to obtain a tractable approximation of the full likelihood. Our composite robust procedures are built on pairs of observations to gain robustness in the ICM. We propose composite τ-estimators for linear mixed models. Composite τ-estimators are proved to have a high breakdown point both in the CCM and ICM. A Monte Carlo study shows that while classical S-estimators can only cope with outliers generated by the CCM, the estimators proposed here are resistant to both CCM and ICM outliers. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1764-1774 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1115358 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1115358 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1764-1774 Template-Type: ReDIF-Article 1.0 Author-Name: Xinyu Zhang Author-X-Name-First: Xinyu Author-X-Name-Last: Zhang Author-Name: Dalei Yu Author-X-Name-First: Dalei Author-X-Name-Last: Yu Author-Name: Guohua Zou Author-X-Name-First: Guohua Author-X-Name-Last: Zou Author-Name: Hua Liang Author-X-Name-First: Hua Author-X-Name-Last: Liang Title: Optimal Model Averaging Estimation for Generalized Linear Models and Generalized Linear Mixed-Effects Models Abstract: Considering model averaging estimation in generalized linear models, we propose a weight choice criterion based on the Kullback–Leibler (KL) loss with a penalty term. This criterion is different from that for continuous observations in principle, but reduces to the Mallows criterion in the situation. We prove that the corresponding model averaging estimator is asymptotically optimal under certain assumptions. We further extend our concern to the generalized linear mixed-effects model framework and establish associated theory. Numerical experiments illustrate that the proposed method is promising. Journal: Journal of the American Statistical Association Pages: 1775-1790 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1115762 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1115762 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1775-1790 Template-Type: ReDIF-Article 1.0 Author-Name: Abhra Sarkar Author-X-Name-First: Abhra Author-X-Name-Last: Sarkar Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Bayesian Nonparametric Modeling of Higher Order Markov Chains Abstract: We consider the problem of flexible modeling of higher order Markov chains when an upper bound on the order of the chain is known but the true order and nature of the serial dependence are unknown. We propose Bayesian nonparametric methodology based on conditional tensor factorizations, which can characterize any transition probability with a specified maximal order. The methodology selects the important lags and captures higher order interactions among the lags, while also facilitating calculation of Bayes factors for a variety of hypotheses of interest. We design efficient Markov chain Monte Carlo algorithms for posterior computation, allowing for uncertainty in the set of important lags to be included and in the nature and order of the serial dependence. The methods are illustrated using simulation experiments and real world applications. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1791-1803 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1115763 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1115763 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1791-1803 Template-Type: ReDIF-Article 1.0 Author-Name: Degui Li Author-X-Name-First: Degui Author-X-Name-Last: Li Author-Name: Junhui Qian Author-X-Name-First: Junhui Author-X-Name-Last: Qian Author-Name: Liangjun Su Author-X-Name-First: Liangjun Author-X-Name-Last: Su Title: Panel Data Models With Interactive Fixed Effects and Multiple Structural Breaks Abstract: In this article, we consider estimation of common structural breaks in panel data models with unobservable interactive fixed effects. We introduce a penalized principal component (PPC) estimation procedure with an adaptive group fused LASSO to detect the multiple structural breaks in the models. Under some mild conditions, we show that with probability approaching one the proposed method can correctly determine the unknown number of breaks and consistently estimate the common break dates. Furthermore, we estimate the regression coefficients through the post-LASSO method and establish the asymptotic distribution theory for the resulting estimators. The developed methodology and theory are applicable to the case of dynamic panel data models. Simulation results demonstrate that the proposed method works well in finite samples with low false detection probability when there is no structural break and high probability of correctly estimating the break numbers when the structural breaks exist. We finally apply our method to study the environmental Kuznets curve for 74 countries over 40 years and detect two breaks in the data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1804-1819 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1119696 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1119696 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1804-1819 Template-Type: ReDIF-Article 1.0 Author-Name: Colin B. Fogarty Author-X-Name-First: Colin B. Author-X-Name-Last: Fogarty Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Title: Sensitivity Analysis for Multiple Comparisons in Matched Observational Studies Through Quadratically Constrained Linear Programming Abstract: A sensitivity analysis in an observational study assesses the robustness of significant findings to unmeasured confounding. While sensitivity analyses in matched observational studies have been well addressed when there is a single outcome variable, accounting for multiple comparisons through the existing methods yields overly conservative results when there are multiple outcome variables of interest. This stems from the fact that unmeasured confounding cannot affect the probability of assignment to treatment differently depending on the outcome being analyzed. Existing methods allow this to occur by combining the results of individual sensitivity analyses to assess whether at least one hypothesis is significant, which in turn results in an overly pessimistic assessment of a study's sensitivity to unobserved biases. By solving a quadratically constrained linear program, we are able to perform a sensitivity analysis while enforcing that unmeasured confounding must have the same impact on the treatment assignment probabilities across outcomes for each individual in the study. We show that this allows for uniform improvements in the power of a sensitivity analysis not only for testing the overall null of no effect, but also for null hypotheses on specific outcome variables while strongly controlling the familywise error rate. We illustrate our method through an observational study on the effect of smoking on naphthalene exposure. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1820-1830 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1120675 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1120675 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1820-1830 Template-Type: ReDIF-Article 1.0 Author-Name: J. R. Lockwood Author-X-Name-First: J. R. Author-X-Name-Last: Lockwood Author-Name: Daniel F. McCaffrey Author-X-Name-First: Daniel F. Author-X-Name-Last: McCaffrey Title: Matching and Weighting With Functions of Error-Prone Covariates for Causal Inference Abstract: Matching estimators are commonly used to estimate causal effects in nonexperimental settings. Covariate measurement error can be problematic for matching estimators when observational treatment groups differ on latent quantities observed only through error-prone surrogates. We establish necessary and sufficient conditions for matching and weighting with functions of observed covariates to yield unconfounded causal effect estimators, generalizing results from the standard (i.e., no measurement error) case. We establish that in common covariate measurement error settings, including continuous variables with continuous measurement error, discrete variables with misclassification, and factor and item response theory models, no single function of the observed covariates computed for all units in a study is appropriate for matching. However, we demonstrate that in some circumstances, it is possible to create different functions of the observed covariates for treatment and control units to construct a variable appropriate for matching. We also demonstrate the counterintuitive result that in some settings, it is possible to selectively contaminate the covariates with additional measurement error to construct a variable appropriate for matching. We discuss the implications of our results for the choice between matching and weighting estimators with error-prone covariates. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1831-1839 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2015.1122601 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1122601 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1831-1839 Template-Type: ReDIF-Article 1.0 Author-Name: Guanhua Chen Author-X-Name-First: Guanhua Author-X-Name-Last: Chen Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Michael R. Kosorok Author-X-Name-First: Michael R. Author-X-Name-Last: Kosorok Title: Personalized Dose Finding Using Outcome Weighted Learning Abstract: In dose-finding clinical trials, it is becoming increasingly important to account for individual-level heterogeneity while searching for optimal doses to ensure an optimal individualized dose rule (IDR) maximizes the expected beneficial clinical outcome for each individual. In this article, we advocate a randomized trial design where candidate dose levels assigned to study subjects are randomly chosen from a continuous distribution within a safe range. To estimate the optimal IDR using such data, we propose an outcome weighted learning method based on a nonconvex loss function, which can be solved efficiently using a difference of convex functions algorithm. The consistency and convergence rate for the estimated IDR are derived, and its small-sample performance is evaluated via simulation studies. We demonstrate that the proposed method outperforms competing approaches. Finally, we illustrate this method using data from a cohort study for warfarin (an anti-thrombotic drug) dosing. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1509-1521 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1148611 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1148611 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1509-1521 Template-Type: ReDIF-Article 1.0 Author-Name: Martin Lysy Author-X-Name-First: Martin Author-X-Name-Last: Lysy Author-Name: Natesh S. Pillai Author-X-Name-First: Natesh S. Author-X-Name-Last: Pillai Author-Name: David B. Hill Author-X-Name-First: David B. Author-X-Name-Last: Hill Author-Name: M. Gregory Forest Author-X-Name-First: M. Gregory Author-X-Name-Last: Forest Author-Name: John W. R. Mellnik Author-X-Name-First: John W. R. Author-X-Name-Last: Mellnik Author-Name: Paula A. Vasquez Author-X-Name-First: Paula A. Author-X-Name-Last: Vasquez Author-Name: Scott A. McKinley Author-X-Name-First: Scott A. Author-X-Name-Last: McKinley Title: Model Comparison and Assessment for Single Particle Tracking in Biological Fluids Abstract: State-of-the-art techniques in passive particle-tracking microscopy provide high-resolution path trajectories of diverse foreign particles in biological fluids. For particles on the order of 1 μm diameter, these paths are generally inconsistent with simple Brownian motion. Yet, despite an abundance of data confirming these findings and their wide-ranging scientific implications, stochastic modeling of the complex particle motion has received comparatively little attention. Even among posited models, there is virtually no literature on likelihood-based inference, model comparisons, and other quantitative assessments. In this article, we develop a rigorous and computationally efficient Bayesian methodology to address this gap. We analyze two of the most prevalent candidate models for 30-sec paths of 1 μm diameter tracer particles in human lung mucus: fractional Brownian motion (fBM) and a Generalized Langevin Equation (GLE) consistent with viscoelastic theory. Our model comparisons distinctly favor GLE over fBM, with the former describing the data remarkably well up to the timescales for which we have reliable information. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1413-1426 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1158716 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1158716 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1413-1426 Template-Type: ReDIF-Article 1.0 Author-Name: Yize Zhao Author-X-Name-First: Yize Author-X-Name-Last: Zhao Author-Name: Matthias Chung Author-X-Name-First: Matthias Author-X-Name-Last: Chung Author-Name: Brent A. Johnson Author-X-Name-First: Brent A. Author-X-Name-Last: Johnson Author-Name: Carlos S. Moreno Author-X-Name-First: Carlos S. Author-X-Name-Last: Moreno Author-Name: Qi Long Author-X-Name-First: Qi Author-X-Name-Last: Long Title: Hierarchical Feature Selection Incorporating Known and Novel Biological Information: Identifying Genomic Features Related to Prostate Cancer Recurrence Abstract: Our work is motivated by a prostate cancer study aimed at identifying mRNA and miRNA biomarkers that are predictive of cancer recurrence after prostatectomy. It has been shown in the literature that incorporating known biological information on pathway memberships and interactions among biomarkers improves feature selection of high-dimensional biomarkers in relation to disease risk. Biological information is often represented by graphs or networks, in which biomarkers are represented by nodes and interactions among them are represented by edges; however, biological information is often not fully known. For example, the role of microRNAs (miRNAs) in regulating gene expression is not fully understood and the miRNA regulatory network is not fully established, in which case new strategies are needed for feature selection. To this end, we treat unknown biological information as missing data (i.e., missing edges in graphs), different from commonly encountered missing data problems where variable values are missing. We propose a new concept of imputing unknown biological information based on observed data and define the imputed information as the novel biological information. In addition, we propose a hierarchical group penalty to encourage sparsity and feature selection at both the pathway level and the within-pathway level, which, combined with the imputation step, allows for incorporation of known and novel biological information. While it is applicable to general regression settings, we develop and investigate the proposed approach in the context of semiparametric accelerated failure time models motivated by our data example. Data application and simulation studies show that incorporation of novel biological information improves performance in risk prediction and feature selection and the proposed penalty outperforms the extensions of several existing penalties. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1427-1439 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1164051 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1164051 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1427-1439 Template-Type: ReDIF-Article 1.0 Author-Name: Mark Fiecas Author-X-Name-First: Mark Author-X-Name-Last: Fiecas Author-Name: Hernando Ombao Author-X-Name-First: Hernando Author-X-Name-Last: Ombao Title: Modeling the Evolution of Dynamic Brain Processes During an Associative Learning Experiment Abstract: We develop a new time series model to investigate the dynamic interactions between the nucleus accumbens and the hippocampus during an associative learning experiment. Preliminary analyses indicated that the spectral properties of the local field potentials at these two regions changed over the trials of the experiment. While many models already take into account nonstationarity within a single trial, the evolution of the dynamics across trials is often ignored. Our proposed model, the slowly evolving locally stationary process (SEv-LSP), is designed to capture nonstationarity both within a trial and across trials. We rigorously define the evolving evolutionary spectral density matrix, which we estimate using a two-stage procedure. In the first stage, we compute the within-trial time-localized periodogram matrix. In the second stage, we develop a data-driven approach that combines information from trial-specific local periodogram matrices. Through simulation studies, we show the utility of our proposed method for analyzing time series data with different evolutionary structures. Finally, we use the SEv-LSP model to demonstrate the evolving dynamics between the hippocampus and the nucleus accumbens during an associative learning experiment. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1440-1453 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1165683 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1165683 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1440-1453 Template-Type: ReDIF-Article 1.0 Author-Name: J. T. Gaskins Author-X-Name-First: J. T. Author-X-Name-Last: Gaskins Author-Name: M. J. Daniels Author-X-Name-First: M. J. Author-X-Name-Last: Daniels Author-Name: B. H. Marcus Author-X-Name-First: B. H. Author-X-Name-Last: Marcus Title: Bayesian Methods for Nonignorable Dropout in Joint Models in Smoking Cessation Studies Abstract: Inference on data with missingness can be challenging, particularly if the knowledge that a measurement was unobserved provides information about its distribution. Our work is motivated by the Commit to Quit II study, a smoking cessation trial that measured smoking status and weight change as weekly outcomes. It is expected that dropout in this study was informative and that patients with missed measurements are more likely to be smoking, even after conditioning on their observed smoking and weight history. We jointly model the categorical smoking status and continuous weight change outcomes by assuming normal latent variables for cessation and by extending the usual pattern mixture model to the bivariate case. The model includes a novel approach to sharing information across patterns through a Bayesian shrinkage framework to improve estimation stability for sparsely observed patterns. To accommodate the presumed informativeness of the missing data in a parsimonious manner, we model the unidentified components of the model under a nonfuture dependence assumption and specify departures from missing at random through sensitivity parameters, whose distributions are elicited from a subject-matter expert. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1454-1465 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1167693 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1167693 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1454-1465 Template-Type: ReDIF-Article 1.0 Author-Name: Jared S. Murray Author-X-Name-First: Jared S. Author-X-Name-Last: Murray Author-Name: Jerome P. Reiter Author-X-Name-First: Jerome P. Author-X-Name-Last: Reiter Title: Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models With Local Dependence Abstract: We present a nonparametric Bayesian joint model for multivariate continuous and categorical variables, with the intention of developing a flexible engine for multiple imputation of missing values. The model fuses Dirichlet process mixtures of multinomial distributions for categorical variables with Dirichlet process mixtures of multivariate normal distributions for continuous variables. We incorporate dependence between the continuous and categorical variables by (1) modeling the means of the normal distributions as component-specific functions of the categorical variables and (2) forming distinct mixture components for the categorical and continuous data with probabilities that are linked via a hierarchical model. This structure allows the model to capture complex dependencies between the categorical and continuous data with minimal tuning by the analyst. We apply the model to impute missing values due to item nonresponse in an evaluation of the redesign of the Survey of Income and Program Participation (SIPP). The goal is to compare estimates from a field test with the new design to estimates from selected individuals from a panel collected under the old design. We show that accounting for the missing data changes some conclusions about the comparability of the distributions in the two datasets. We also perform an extensive repeated sampling simulation using similar data from complete cases in an existing SIPP panel, comparing our proposed model to a default application of multiple imputation by chained equations. Imputations based on the proposed model tend to have better repeated sampling properties than the default application of chained equations in this realistic setting. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1466-1479 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1174132 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1174132 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1466-1479 Template-Type: ReDIF-Article 1.0 Author-Name: Wolfgang Karl Härdle Author-X-Name-First: Wolfgang Karl Author-X-Name-Last: Härdle Author-Name: Brenda López Cabrera Author-X-Name-First: Brenda Author-X-Name-Last: López Cabrera Author-Name: Ostap Okhrin Author-X-Name-First: Ostap Author-X-Name-Last: Okhrin Author-Name: Weining Wang Author-X-Name-First: Weining Author-X-Name-Last: Wang Title: Localizing Temperature Risk Abstract: On the temperature derivative market, modeling temperature volatility is an important issue for pricing and hedging. To apply the pricing tools of financial mathematics, one needs to isolate a Gaussian risk factor. A conventional model for temperature dynamics is a stochastic model with seasonality and intertemporal autocorrelation. Empirical work based on seasonality and autocorrelation correction reveals that the obtained residuals are heteroscedastic with a periodic pattern. The object of this research is to estimate this heteroscedastic function so that, after scale normalization, a pure standardized Gaussian variable appears. Earlier works investigated temperature risk in different locations and showed that neither parametric component functions nor a local linear smoother with constant smoothing parameter are flexible enough to generally describe the variance process well. Therefore, we consider a local adaptive modeling approach to find, at each time point, an optimal smoothing parameter to locally estimate the seasonality and volatility. Our approach provides a more flexible and accurate fitting procedure for localized temperature risk by achieving nearly normal risk factors. We also employ our model to forecast the temperaturein different cities and compare it to a model developed in 2005 by Campbell and Diebold. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1491-1508 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1180985 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1180985 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1491-1508 Template-Type: ReDIF-Article 1.0 Author-Name: Simon N. Wood Author-X-Name-First: Simon N. Author-X-Name-Last: Wood Author-Name: Natalya Pya Author-X-Name-First: Natalya Author-X-Name-Last: Pya Author-Name: Benjamin Säfken Author-X-Name-First: Benjamin Author-X-Name-Last: Säfken Title: Smoothing Parameter and Model Selection for General Smooth Models Abstract: This article discusses a general framework for smoothing parameter estimation for models with regular likelihoods constructed in terms of unknown smooth functions of covariates. Gaussian random effects and parametric terms may also be present. By construction the method is numerically stable and convergent, and enables smoothing parameter uncertainty to be quantified. The latter enables us to fix a well known problem with AIC for such models, thereby improving the range of model selection tools available. The smooth functions are represented by reduced rank spline like smoothers, with associated quadratic penalties measuring function smoothness. Model estimation is by penalized likelihood maximization, where the smoothing parameters controlling the extent of penalization are estimated by Laplace approximate marginal likelihood. The methods cover, for example, generalized additive models for nonexponential family responses (e.g., beta, ordered categorical, scaled t distribution, negative binomial and Tweedie distributions), generalized additive models for location scale and shape (e.g., two stage zero inflation models, and Gaussian location-scale models), Cox proportional hazards models and multivariate additive models. The framework reduces the implementation of new model classes to the coding of some standard derivatives of the log-likelihood. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1548-1563 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1180986 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1180986 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1548-1563 Template-Type: ReDIF-Article 1.0 Author-Name: Jörg Polzehl Author-X-Name-First: Jörg Author-X-Name-Last: Polzehl Author-Name: Karsten Tabelow Author-X-Name-First: Karsten Author-X-Name-Last: Tabelow Title: Low SNR in Diffusion MRI Models Abstract: Noise is a common issue for all magnetic resonance imaging (MRI) techniques such as diffusion MRI and obviously leads to variability of the estimates in any model describing the data. Increasing spatial resolution in MR experiments further diminishes the signal-to-noise ratio (SNR). However, with low SNR the expected signal deviates from the true value. Common modeling approaches therefore lead to a bias in estimated model parameters. Adjustments require an analysis of the data generating process and a characterization of the resulting distribution of the imaging data. We provide an adequate quasi-likelihood approach that employs these characteristics. We elaborate on the effects of typical data preprocessing and analyze the bias effects related to low SNR for the example of the diffusion tensor model in diffusion MRI. We then demonstrate the relevance of the problem using data from the Human Connectome Project. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1480-1490 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1222284 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222284 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1480-1490 Template-Type: ReDIF-Article 1.0 Author-Name: Michael P. Wallace Author-X-Name-First: Michael P. Author-X-Name-Last: Wallace Author-Name: Erica E. M. Moodie Author-X-Name-First: Erica E. M. Author-X-Name-Last: Moodie Author-Name: David A. Stephens Author-X-Name-First: David A. Author-X-Name-Last: Stephens Title: Comment Journal: Journal of the American Statistical Association Pages: 1530-1534 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1240080 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240080 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1530-1534 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Correction Journal: Journal of the American Statistical Association Pages: 1852-1852 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1240685 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240685 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1852-1852 Template-Type: ReDIF-Article 1.0 Author-Name: Alexander R. Luedtke Author-X-Name-First: Alexander R. Author-X-Name-Last: Luedtke Author-Name: Mark J. van der Laan Author-X-Name-First: Mark J. van der Author-X-Name-Last: Laan Title: Comment Journal: Journal of the American Statistical Association Pages: 1526-1530 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1242427 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1242427 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1526-1530 Template-Type: ReDIF-Article 1.0 Author-Name: Min Qian Author-X-Name-First: Min Author-X-Name-Last: Qian Title: Comment Abstract: This commentary deals with issues related to the article by Chen, Zeng, and Kosorok. We present several potential modifications of the outcome weighted learning approach. Those modifications are based on truncated l2 loss. One advantage of l2 loss is that it is differentiable everywhere, which makes it more stable and computationally more tractable. Journal: Journal of the American Statistical Association Pages: 1538-1541 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1243479 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1243479 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1538-1541 Template-Type: ReDIF-Article 1.0 Author-Name: Elizabeth L. Ogburn Author-X-Name-First: Elizabeth L. Author-X-Name-Last: Ogburn Title: Comment Journal: Journal of the American Statistical Association Pages: 1534-1537 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1243480 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1243480 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1534-1537 Template-Type: ReDIF-Article 1.0 Author-Name: Michael Rosenblum Author-X-Name-First: Michael Author-X-Name-Last: Rosenblum Title: Comment Journal: Journal of the American Statistical Association Pages: 1541-1542 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1243481 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1243481 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1541-1542 Template-Type: ReDIF-Article 1.0 Author-Name: Tianxi Cai Author-X-Name-First: Tianxi Author-X-Name-Last: Cai Author-Name: Lu Tian Author-X-Name-First: Lu Author-X-Name-Last: Tian Title: Comment Journal: Journal of the American Statistical Association Pages: 1521-1524 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1244064 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1244064 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1521-1524 Template-Type: ReDIF-Article 1.0 Author-Name: Jun Fan Author-X-Name-First: Jun Author-X-Name-Last: Fan Author-Name: Ming Yuan Author-X-Name-First: Ming Author-X-Name-Last: Yuan Title: Comment Journal: Journal of the American Statistical Association Pages: 1524-1525 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1244065 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1244065 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1524-1525 Template-Type: ReDIF-Article 1.0 Author-Name: Edoardo M. Airoldi Author-X-Name-First: Edoardo M. Author-X-Name-Last: Airoldi Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 1410-1412 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1245070 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1245070 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1410-1412 Template-Type: ReDIF-Article 1.0 Author-Name: David M. Blei Author-X-Name-First: David M. Author-X-Name-Last: Blei Title: Comment Journal: Journal of the American Statistical Association Pages: 1408-1410 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1245071 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1245071 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1408-1410 Template-Type: ReDIF-Article 1.0 Author-Name: Aleksandrina Goeva Author-X-Name-First: Aleksandrina Author-X-Name-Last: Goeva Author-Name: Eric D. Kolaczyk Author-X-Name-First: Eric D. Author-X-Name-Last: Kolaczyk Title: Comment Journal: Journal of the American Statistical Association Pages: 1405-1408 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1245072 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1245072 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1405-1408 Template-Type: ReDIF-Article 1.0 Author-Name: Matt Taddy Author-X-Name-First: Matt Author-X-Name-Last: Taddy Title: Comment Journal: Journal of the American Statistical Association Pages: 1403-1405 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1245073 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1245073 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1403-1405 Template-Type: ReDIF-Article 1.0 Author-Name: Guanhua Chen Author-X-Name-First: Guanhua Author-X-Name-Last: Chen Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Michael R. Kosorok Author-X-Name-First: Michael R. Author-X-Name-Last: Kosorok Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 1543-1547 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1250573 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1250573 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1543-1547 Template-Type: ReDIF-Article 1.0 Author-Name: Thomas Kneib Author-X-Name-First: Thomas Author-X-Name-Last: Kneib Title: Comment Journal: Journal of the American Statistical Association Pages: 1563-1565 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1250576 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1250576 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1563-1565 Template-Type: ReDIF-Article 1.0 Author-Name: Thomas W. Yee Author-X-Name-First: Thomas W. Author-X-Name-Last: Yee Title: Comment Journal: Journal of the American Statistical Association Pages: 1565-1568 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1250579 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1250579 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1565-1568 Template-Type: ReDIF-Article 1.0 Author-Name: Sonja Greven Author-X-Name-First: Sonja Author-X-Name-Last: Greven Author-Name: Fabian Scheipl Author-X-Name-First: Fabian Author-X-Name-Last: Scheipl Title: Comment Journal: Journal of the American Statistical Association Pages: 1568-1573 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1250580 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1250580 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1568-1573 Template-Type: ReDIF-Article 1.0 Author-Name: Simon N. Wood Author-X-Name-First: Simon N. Author-X-Name-Last: Wood Author-Name: Natalya Pya Author-X-Name-First: Natalya Author-X-Name-Last: Pya Author-Name: Benjamin Säfken Author-X-Name-First: Benjamin Author-X-Name-Last: Säfken Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 1573-1575 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1250583 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1250583 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1573-1575 Template-Type: ReDIF-Article 1.0 Author-Name: Jessica Utts Author-X-Name-First: Jessica Author-X-Name-Last: Utts Title: Appreciating Statistics Journal: Journal of the American Statistical Association Pages: 1373-1380 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1250592 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1250592 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1373-1380 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Editorial Collaborators Journal: Journal of the American Statistical Association Pages: 1853-1861 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1255066 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1255066 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1853-1861 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Book Reviews Journal: Journal of the American Statistical Association Pages: 1840-1851 Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1257826 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1257826 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1840-1851 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Editorial Board EOV Journal: Journal of the American Statistical Association Pages: ebi-ebi Issue: 516 Volume: 111 Year: 2016 Month: 10 X-DOI: 10.1080/01621459.2016.1267991 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1267991 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:ebi-ebi Template-Type: ReDIF-Article 1.0 Author-Name: Daniel Peña Author-X-Name-First: Daniel Author-X-Name-Last: Peña Author-Name: Victor J. Yohai Author-X-Name-First: Victor J. Author-X-Name-Last: Yohai Title: Generalized Dynamic Principal Components Abstract: Brillinger defined dynamic principal components (DPC) for time series based on a reconstruction criterion. He gave a very elegant theoretical solution and proposed an estimator which is consistent under stationarity. Here, we propose a new enterally empirical approach to DPC. The main differences with the existing methods—mainly Brillinger procedure—are (1) the DPC we propose need not be a linear combination of the observations and (2) it can be based on a variety of loss functions including robust ones. Unlike Brillinger, we do not establish any consistency results; however, contrary to Brillinger’s, which has a very strong stationarity flavor, our concept aims at a better adaptation to possible nonstationary features of the series. We also present a robust version of our procedure that allows to estimate the DPC when the series have outlier contamination. We give iterative algorithms to compute the proposed procedures that can be used with a large number of variables. Our nonrobust and robust procedures are illustrated with real datasets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1121-1131 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1072542 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1072542 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1121-1131 Template-Type: ReDIF-Article 1.0 Author-Name: Ritabrata Das Author-X-Name-First: Ritabrata Author-X-Name-Last: Das Author-Name: Moulinath Banerjee Author-X-Name-First: Moulinath Author-X-Name-Last: Banerjee Author-Name: Bin Nan Author-X-Name-First: Bin Author-X-Name-Last: Nan Author-Name: Huiyong Zheng Author-X-Name-First: Huiyong Author-X-Name-Last: Zheng Title: Fast Estimation of Regression Parameters in a Broken-Stick Model for Longitudinal Data Abstract: Estimation of change-point locations in the broken-stick model has significant applications in modeling important biological phenomena. In this article, we present a computationally economical likelihood-based approach for estimating change-point(s) efficiently in both cross-sectional and longitudinal settings. Our method, based on local smoothing in a shrinking neighborhood of each change-point, is shown via simulations to be computationally more viable than existing methods that rely on search procedures, with dramatic gains in the multiple change-point case. The proposed estimates are shown to have n$\sqrt{n}$-consistency and asymptotic normality—in particular, they are asymptotically efficient in the cross-sectional setting—allowing us to provide meaningful statistical inference. As our primary and motivating (longitudinal) application, we study the Michigan Bone Health and Metabolism Study cohort data to describe patterns of change in log  estradiol levels, before and after the final menstrual period, for which a two change-point broken-stick model appears to be a good fit. We also illustrate our method on a plant growth dataset in the cross-sectional setting. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1132-1143 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1073154 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1073154 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1132-1143 Template-Type: ReDIF-Article 1.0 Author-Name: Mingyuan Zhou Author-X-Name-First: Mingyuan Author-X-Name-Last: Zhou Author-Name: Oscar Hernan Madrid Padilla Author-X-Name-First: Oscar Hernan Madrid Author-X-Name-Last: Padilla Author-Name: James G. Scott Author-X-Name-First: James G. Author-X-Name-Last: Scott Title: Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes Abstract: We define a family of probability distributions for random count matrices with a potentially unbounded number of rows and columns. The three distributions we consider are derived from the gamma-Poisson, gamma-negative binomial, and beta-negative binomial processes, which we refer to generically as a family of negative-binomial processes. Because the models lead to closed-form update equations within the context of a Gibbs sampler, they are natural candidates for nonparametric Bayesian priors over count matrices. A key aspect of our analysis is the recognition that although the random count matrices within the family are defined by a row-wise construction, their columns can be shown to be independent and identically distributed (iid). This fact is used to derive explicit formulas for drawing all the columns at once. Moreover, by analyzing these matrices’ combinatorial structure, we describe how to sequentially construct a column-iid random count matrix one row at a time, and derive the predictive distribution of a new row count vector with previously unseen features. We describe the similarities and differences between the three priors, and argue that the greater flexibility of the gamma- and beta-negative binomial processes—especially their ability to model over-dispersed, heavy-tailed count data—makes these well suited to a wide variety of real-world applications. As an example of our framework, we construct a naive-Bayes text classifier to categorize a count vector to one of several existing random count matrices of different categories. The classifier supports an unbounded number of features and, unlike most existing methods, it does not require a predefined finite vocabulary to be shared by all the categories, and needs neither feature selection nor parameter tuning. Both the gamma- and beta-negative binomial processes are shown to significantly outperform the gamma-Poisson process when applied to document categorization, with comparable performance to other state-of-the-art supervised text classification algorithms. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1144-1156 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1075407 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1075407 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1144-1156 Template-Type: ReDIF-Article 1.0 Author-Name: Samuel D. Pimentel Author-X-Name-First: Samuel D. Author-X-Name-Last: Pimentel Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Author-Name: Paul R. Rosenbaum Author-X-Name-First: Paul R. Author-X-Name-Last: Rosenbaum Title: Constructed Second Control Groups and Attenuation of Unmeasured Biases Abstract: The informal folklore of observational studies claims that if an irrelevant observed covariate is left uncontrolled, say unmatched, then it will influence treatment assignment in haphazard ways, thereby diminishing the biases from unmeasured covariates. We prove a result along these lines: it is true, in a certain sense, to a limited degree, under certain conditions. Alas, the conditions are neither inconsequential nor easy to check in empirical work; indeed, they are often dubious, more often implausible. We suggest the result is most useful in the computerized construction of a second control group, where the investigator can see more in available data without necessarily believing the required conditions. One of the two control groups controls for the possibly irrelevant observed covariate, the other control group either leaves it uncontrolled or forces separation; therefore, the investigator views one situation from two angles under different assumptions. A pair of sensitivity analyses for the two control groups is coordinated by a weighted Holm or recycling procedure built around the possibility of slight attenuation of bias in one control group. Issues are illustrated using an observational study of the possible effects of cigarette smoking as a cause of increased homocysteine levels, a risk factor for cardiovascular disease. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1157-1167 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1076342 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1076342 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1157-1167 Template-Type: ReDIF-Article 1.0 Author-Name: Fernando A. Quintana Author-X-Name-First: Fernando A. Author-X-Name-Last: Quintana Author-Name: Wesley O. Johnson Author-X-Name-First: Wesley O. Author-X-Name-Last: Johnson Author-Name: L. Elaine Waetjen Author-X-Name-First: L. Elaine Author-X-Name-Last: Waetjen Author-Name: Ellen B. Gold Author-X-Name-First: Ellen Author-X-Name-Last: B. Gold Title: Bayesian Nonparametric Longitudinal Data Analysis Abstract: Practical Bayesian nonparametric methods have been developed across a wide variety of contexts. Here, we develop a novel statistical model that generalizes standard mixed models for longitudinal data that include flexible mean functions as well as combined compound symmetry (CS) and autoregressive (AR) covariance structures. AR structure is often specified through the use of a Gaussian process (GP) with covariance functions that allow longitudinal data to be more correlated if they are observed closer in time than if they are observed farther apart. We allow for AR structure by considering a broader class of models that incorporates a Dirichlet process mixture (DPM) over the covariance parameters of the GP. We are able to take advantage of modern Bayesian statistical methods in making full predictive inferences and about characteristics of longitudinal profiles and their differences across covariate combinations. We also take advantage of the generality of our model, which provides for estimation of a variety of covariance structures. We observe that models that fail to incorporate CS or AR structure can result in very poor estimation of a covariance or correlation matrix. In our illustration using hormone data observed on women through the menopausal transition, biology dictates the use of a generalized family of sigmoid functions as a model for time trends across subpopulation categories. Journal: Journal of the American Statistical Association Pages: 1168-1181 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1076725 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1076725 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1168-1181 Template-Type: ReDIF-Article 1.0 Author-Name: Ning Zhang Author-X-Name-First: Ning Author-X-Name-Last: Zhang Author-Name: Daniel W. Apley Author-X-Name-First: Daniel W. Author-X-Name-Last: Apley Title: Brownian Integrated Covariance Functions for Gaussian Process Modeling: Sigmoidal Versus Localized Basis Functions Abstract: Gaussian process modeling, or kriging, is a popular method for modeling data from deterministic computer simulations, and the most common choices of covariance function are Gaussian, power exponential, and Matérn. A characteristic of these covariance functions is that the basis functions associated with their corresponding response predictors are localized, in the sense that they decay to zero as the input location moves away from the simulated input sites. As a result, the predictors tend to revert to the prior mean, which can result in a bumpy fitted response surface. In contrast, a fractional Brownian field model results in a predictor with basis functions that are nonlocalized and more sigmoidal in shape, although it suffers from drawbacks such as inability to represent smooth response surfaces. We propose a class of Brownian integrated covariance functions obtained by incorporating an integrator (as in the white noise integral representation of a fractional Brownian field) into any stationary covariance function. Brownian integrated covariance models result in predictor basis functions that are nonlocalized and sigmoidal, but they are capable of modeling smooth response surfaces. We discuss fundamental differences between Brownian integrated and other covariance functions, and we illustrate by comparing Brownian integrated power exponential with regular power exponential kriging models in a number of examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1182-1195 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1077711 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1077711 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1182-1195 Template-Type: ReDIF-Article 1.0 Author-Name: Ziqi Chen Author-X-Name-First: Ziqi Author-X-Name-Last: Chen Author-Name: Chenlei Leng Author-X-Name-First: Chenlei Author-X-Name-Last: Leng Title: Dynamic Covariance Models Abstract: An important problem in contemporary statistics is to understand the relationship among a large number of variables based on a dataset, usually with p, the number of the variables, much larger than n, the sample size. Recent efforts have focused on modeling static covariance matrices where pairwise covariances are considered invariant. In many real systems, however, these pairwise relations often change. To characterize the changing correlations in a high-dimensional system, we study a class of dynamic covariance models (DCMs) assumed to be sparse, and investigate for the first time a unified theory for understanding their nonasymptotic error rates and model selection properties. In particular, in the challenging high-dimensional regime, we highlight a new uniform consistency theory in which the sample size can be seen as n4/5 when the bandwidth parameter is chosen as h∝n− 1/5 for accounting for the dynamics. We show that this result holds uniformly over a range of the variable used for modeling the dynamics. The convergence rate bears the mark of the familiar bias-variance trade-off in the kernel smoothing literature. We illustrate the results with simulations and the analysis of a neuroimaging dataset. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1196-1207 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1077712 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1077712 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1196-1207 Template-Type: ReDIF-Article 1.0 Author-Name: Ming-Yen Cheng Author-X-Name-First: Ming-Yen Author-X-Name-Last: Cheng Author-Name: Toshio Honda Author-X-Name-First: Toshio Author-X-Name-Last: Honda Author-Name: Jin-Ting Zhang Author-X-Name-First: Jin-Ting Author-X-Name-Last: Zhang Title: Forward Variable Selection for Sparse Ultra-High Dimensional Varying Coefficient Models Abstract: Varying coefficient models have numerous applications in a wide scope of scientific areas. While enjoying nice interpretability, they also allow for flexibility in modeling dynamic impacts of the covariates. But, in the new era of big data, it is challenging to select the relevant variables when the dimensionality is very large. Recently, several works are focused on this important problem based on sparsity assumptions; they are subject to some limitations, however. We introduce an appealing forward selection procedure. It selects important variables sequentially according to a reduction in sum of squares criterion and it employs a Bayesian information criterion (BIC)-based stopping rule. Clearly, it is simple to implement and fast to compute, and possesses many other desirable properties from theoretical and numerical viewpoints. The BIC is a special case of the extended BIC (EBIC) when an extra tuning parameter in the latter vanishes. We establish rigorous screening consistency results when either BIC or EBIC is used as the stopping criterion. The theoretical results depend on some conditions on the eigenvalues related to the design matrices, which can be relaxed in some situations. Results of an extensive simulation study and a real data example are also presented to show the efficacy and usefulness of our procedure. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1209-1221 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1080708 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1080708 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1209-1221 Template-Type: ReDIF-Article 1.0 Author-Name: Srijan Sengupta Author-X-Name-First: Srijan Author-X-Name-Last: Sengupta Author-Name: Stanislav Volgushev Author-X-Name-First: Stanislav Author-X-Name-Last: Volgushev Author-Name: Xiaofeng Shao Author-X-Name-First: Xiaofeng Author-X-Name-Last: Shao Title: A Subsampled Double Bootstrap for Massive Data Abstract: The bootstrap is a popular and powerful method for assessing precision of estimators and inferential methods. However, for massive datasets that are increasingly prevalent, the bootstrap becomes prohibitively costly in computation and its feasibility is questionable even with modern parallel computing platforms. Recently, Kleiner and co-authors proposed a method called BLB (bag of little bootstraps) for massive data, which is more computationally scalable with little sacrifice of statistical accuracy. Building on BLB and the idea of fast double bootstrap, we propose a new resampling method, the subsampled double bootstrap, for both independent data and time series data. We establish consistency of the subsampled double bootstrap under mild conditions for both independent and dependent cases. Methodologically, the subsampled double bootstrap is superior to BLB in terms of running time, more sample coverage, and automatic implementation with less tuning parameters for a given time budget. Its advantage relative to BLB and bootstrap is also demonstrated in numerical simulations and a data illustration. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1222-1232 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1080709 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1080709 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1222-1232 Template-Type: ReDIF-Article 1.0 Author-Name: Yanxun Xu Author-X-Name-First: Yanxun Author-X-Name-Last: Xu Author-Name: Peter Müller Author-X-Name-First: Peter Author-X-Name-Last: Müller Author-Name: Abdus S. Wahed Author-X-Name-First: Abdus S. Author-X-Name-Last: Wahed Author-Name: Peter F. Thall Author-X-Name-First: Peter F. Author-X-Name-Last: Thall Title: Bayesian Nonparametric Estimation for Dynamic Treatment Regimes With Sequential Transition Times Abstract: We analyze a dataset arising from a clinical trial involving multi-stage chemotherapy regimes for acute leukemia. The trial design was a 2 × 2 factorial for frontline therapies only. Motivated by the idea that subsequent salvage treatments affect survival time, we model therapy as a dynamic treatment regime (DTR), that is, an alternating sequence of adaptive treatments or other actions and transition times between disease states. These sequences may vary substantially between patients, depending on how the regime plays out. To evaluate the regimes, mean overall survival time is expressed as a weighted average of the means of all possible sums of successive transitions times. We assume a Bayesian nonparametric survival regression model for each transition time, with a dependent Dirichlet process prior and Gaussian process base measure (DDP-GP). Posterior simulation is implemented by Markov chain Monte Carlo (MCMC) sampling. We provide general guidelines for constructing a prior using empirical Bayes methods. The proposed approach is compared with inverse probability of treatment weighting, including a doubly robust augmented version of this approach, for both single-stage and multi-stage regimes with treatment assignment depending on baseline covariates. The simulations show that the proposed nonparametric Bayesian approach can substantially improve inference compared to existing methods. An R program for implementing the DDP-GP-based Bayesian nonparametric analysis is freely available at www.ams.jhu.edu/yxu70. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 921-950 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1086353 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1086353 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:921-950 Template-Type: ReDIF-Article 1.0 Author-Name: Ulrich K. Müller Author-X-Name-First: Ulrich K. Author-X-Name-Last: Müller Author-Name: Andriy Norets Author-X-Name-First: Andriy Author-X-Name-Last: Norets Title: Coverage Inducing Priors in Nonstandard Inference Problems Abstract: We consider the construction of set estimators that possess both Bayesian credibility and frequentist coverage properties. We show that under mild regularity conditions there exists a prior distribution that induces (1 − α) frequentist coverage of a (1 − α) credible set. In contrast to the previous literature, this result does not rely on asymptotic normality or invariance, so it can be applied in nonstandard inference problems. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1233-1241 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1086654 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1086654 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1233-1241 Template-Type: ReDIF-Article 1.0 Author-Name: Junhui Wang Author-X-Name-First: Junhui Author-X-Name-Last: Wang Author-Name: Xiaotong Shen Author-X-Name-First: Xiaotong Author-X-Name-Last: Shen Author-Name: Yiwen Sun Author-X-Name-First: Yiwen Author-X-Name-Last: Sun Author-Name: Annie Qu Author-X-Name-First: Annie Author-X-Name-Last: Qu Title: Classification With Unstructured Predictors and an Application to Sentiment Analysis Abstract: Unstructured data refer to information that lacks certain structures and cannot be organized in a predefined fashion. Unstructured data often involve words, texts, graphs, objects, or multimedia types of files that are difficult to process and analyze with traditional computational tools and statistical methods. This work explores ordinal classification for unstructured predictors with ordered class categories, where imprecise information concerning strengths of association between predictors is available for predicting class labels. However, imprecise information here is expressed in terms of a directed graph, with each node representing a predictor and a directed edge containing pairwise strengths of association between two nodes. One of the targeted applications for unstructured data arises from sentiment analysis, which identifies and extracts the relevant content or opinion of a document concerning a specific event of interest. We integrate the imprecise predictor relations into linear relational constraints over classification function coefficients, where large margin ordinal classifiers are introduced, subject to many quadratically linear constraints. The proposed classifiers are then applied in sentiment analysis using binary word predictors. Computationally, we implement ordinal support vector machines and ψ-learning through a scalable quadratic programming package based on sparse word representations. Theoretically, we show that using relationships among unstructured predictors improves prediction accuracy of classification significantly. We illustrate an application for sentiment analysis using consumer text reviews and movie review data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1242-1253 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1089771 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1089771 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1242-1253 Template-Type: ReDIF-Article 1.0 Author-Name: Will Wei Sun Author-X-Name-First: Will Wei Author-X-Name-Last: Sun Author-Name: Xingye Qiao Author-X-Name-First: Xingye Author-X-Name-Last: Qiao Author-Name: Guang Cheng Author-X-Name-First: Guang Author-X-Name-Last: Cheng Title: Stabilized Nearest Neighbor Classifier and its Statistical Properties Abstract: The stability of statistical analysis is an important indicator for reproducibility, which is one main principle of the scientific method. It entails that similar statistical conclusions can be reached based on independent samples from the same underlying population. In this article, we introduce a general measure of classification instability (CIS) to quantify the sampling variability of the prediction made by a classification method. Interestingly, the asymptotic CIS of any weighted nearest neighbor classifier turns out to be proportional to the Euclidean norm of its weight vector. Based on this concise form, we propose a stabilized nearest neighbor (SNN) classifier, which distinguishes itself from other nearest neighbor classifiers, by taking the stability into consideration. In theory, we prove that SNN attains the minimax optimal convergence rate in risk, and a sharp convergence rate in CIS. The latter rate result is established for general plug-in classifiers under a low-noise condition. Extensive simulated and real examples demonstrate that SNN achieves a considerable improvement in CIS over existing nearest neighbor classifiers, with comparable classification accuracy. We implement the algorithm in a publicly available R package snn. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1254-1265 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1089772 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1089772 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1254-1265 Template-Type: ReDIF-Article 1.0 Author-Name: Emre Barut Author-X-Name-First: Emre Author-X-Name-Last: Barut Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Anneleen Verhasselt Author-X-Name-First: Anneleen Author-X-Name-Last: Verhasselt Title: Conditional Sure Independence Screening Abstract: Independence screening is powerful for variable selection when the number of variables is massive. Commonly used independence screening methods are based on marginal correlations or its variants. When some prior knowledge on a certain important set of variables is available, a natural assessment on the relative importance of the other predictors is their conditional contributions to the response given the known set of variables. This results in conditional sure independence screening (CSIS). CSIS produces a rich family of alternative screening methods by different choices of the conditioning set and can help reduce the number of false positive and false negative selections when covariates are highly correlated. This article proposes and studies CSIS in generalized linear models. We give conditions under which sure screening is possible and derive an upper bound on the number of selected variables. We also spell out the situation under which CSIS yields model selection consistency and the properties of CSIS when a data-driven conditioning set is used. Moreover, we provide two data-driven methods to select the thresholding parameter of conditional screening. The utility of the procedure is illustrated by simulation studies and analysis of two real datasets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1266-1277 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1092974 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1092974 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1266-1277 Template-Type: ReDIF-Article 1.0 Author-Name: Christopher J. Bennett Author-X-Name-First: Christopher J. Author-X-Name-Last: Bennett Author-Name: Brennan S. Thompson Author-X-Name-First: Brennan S. Author-X-Name-Last: Thompson Title: Graphical Procedures for Multiple Comparisons Under General Dependence Abstract: It has been more than half a century since Tukey first introduced graphical displays that relate nonoverlap of confidence intervals to statistically significant differences between parameter estimates. In this article, we show how Tukey’s graphical overlap procedure can be modified to accommodate general forms of dependence within and across samples. We also develop a procedure that can be used to more effectively resolve rankings within the tails of the distributions of parameter values, thereby generalizing existing methods for “multiple comparisons with the best.” We show that these new procedures retain the simplicity of Tukey’s original procedure, while maintaining asymptotic control of the familywise error rate under very general conditions. Simple examples are used throughout to illustrate the procedures. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1278-1288 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1093941 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1093941 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1278-1288 Template-Type: ReDIF-Article 1.0 Author-Name: Gang Li Author-X-Name-First: Gang Author-X-Name-Last: Li Author-Name: Qing Yang Author-X-Name-First: Qing Author-X-Name-Last: Yang Title: Joint Inference for Competing Risks Survival Data Abstract: This article develops joint inferential methods for the cause-specific hazard function and the cumulative incidence function of a specific type of failure to assess the effects of a variable on the time to the type of failure of interest in the presence of competing risks. Joint inference for the two functions are needed in practice because (i) they describe different characteristics of a given type of failure, (ii) they do not uniquely determine each other, and (iii) the effects of a variable on the two functions can be different and one often does not know which effects are to be expected. We study both the group comparison problem and the regression problem. We also discuss joint inference for other related functions. Our simulation shows that our joint tests can be considerably more powerful than the Bonferroni method, which has important practical implications to the analysis and design of clinical studies with competing risks data. We illustrate our method using a Hodgkin disease data and a lymphoma data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1289-1300 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1093942 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1093942 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1289-1300 Template-Type: ReDIF-Article 1.0 Author-Name: Samiran Sinha Author-X-Name-First: Samiran Author-X-Name-Last: Sinha Author-Name: Yanyuan Ma Author-X-Name-First: Yanyuan Author-X-Name-Last: Ma Title: Analysis of Proportional Odds Models With Censoring and Errors-in-Covariates Abstract: We propose a consistent method for estimating both the finite- and infinite-dimensional parameters of the proportional odds model when a covariate is subject to measurement error and time-to-events are subject to right censoring. The proposed method does not rely on the distributional assumption of the true covariate, which is not observed in the data. In addition, the proposed estimator does not require the measurement error to be normally distributed or to have any other specific distribution, and we do not attempt to assess the error distribution. Instead, we construct martingale-based estimators through inversion, using only the moment properties of the error distribution, estimable from multiple erroneous measurements of the true covariate. The theoretical properties of the estimators are established and the finite sample performance is demonstrated via simulations. We illustrate the usefulness of the method by analyzing a dataset from a clinical study on AIDS. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1301-1312 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1093943 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1093943 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1301-1312 Template-Type: ReDIF-Article 1.0 Author-Name: Efstathia Bura Author-X-Name-First: Efstathia Author-X-Name-Last: Bura Author-Name: Sabrina Duarte Author-X-Name-First: Sabrina Author-X-Name-Last: Duarte Author-Name: Liliana Forzani Author-X-Name-First: Liliana Author-X-Name-Last: Forzani Title: Sufficient Reductions in Regressions With Exponential Family Inverse Predictors Abstract: We develop methodology for identifying and estimating sufficient reductions in regressions with predictors that, given the response, follow a multivariate exponential family distribution. This setup includes regressions where predictors are all continuous, all categorical, or mixtures of categorical and continuous. We derive the minimal sufficient reduction of the predictors and its maximum likelihood estimator by modeling the conditional distribution of the predictors given the response. Whereas nearly all extant estimators of sufficient reductions are linear and only partly capture the sufficient reduction, our method is not limited to linear reductions. It also provides the exact form of the sufficient reduction, which is exhaustive, its maximum likelihood (ML) estimates via an iterated reweighted least-square (IRLS) estimation algorithm, and asymptotic tests for the dimension of the regression. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1313-1329 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1093944 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1093944 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1313-1329 Template-Type: ReDIF-Article 1.0 Author-Name: Carsten Jentsch Author-X-Name-First: Carsten Author-X-Name-Last: Jentsch Author-Name: Claudia Kirch Author-X-Name-First: Claudia Author-X-Name-Last: Kirch Title: How Much Information Does Dependence Between Wavelet Coefficients Contain? Abstract: This article is motivated by several articles that propose statistical inference where the independence of wavelet coefficients for both short- as well as long-range dependent time series is assumed. We focus on the sample variance and investigate the influence of the dependence between wavelet coefficients and this statistic. To this end, we derive asymptotic distributional properties of the sample variance for a time series that is synthesized, ignoring some or all dependence between wavelet coefficients. We show that the second-order properties differ from the those of the true time series whose wavelet coefficients have the same marginal distribution except in the independent Gaussian case. This holds true even if the dependency is correct within each level and only the dependence between levels is ignored. In the case of sample autocovariances and sample autocorrelations at lag one, we indicate that first-order properties are erroneous. In a second step, we investigate several nonparametric bootstrap schemes in the wavelet domain, which take more and more dependence into account until finally the full dependency is mimicked. We obtain very similar results, where only a bootstrap mimicking the full covariance structure correctly can be valid asymptotically. A simulation study supports our theoretical findings for the wavelet domain bootstraps. For long-range-dependent time series with long-memory parameter d > 1/4, we show that some additional problems occur, which cannot be solved easily without using additional information for the bootstrap. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1330-1345 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1093945 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1093945 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1330-1345 Template-Type: ReDIF-Article 1.0 Author-Name: Paula Moraga Author-X-Name-First: Paula Author-X-Name-Last: Moraga Title: Comment Journal: Journal of the American Statistical Association Pages: 1110-1111 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1116989 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1116989 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1110-1111 Template-Type: ReDIF-Article 1.0 Author-Name: Peter J. Diggle Author-X-Name-First: Peter J. Author-X-Name-Last: Diggle Author-Name: Emanuele Giorgi Author-X-Name-First: Emanuele Author-X-Name-Last: Giorgi Title: Model-Based Geostatistics for Prevalence Mapping in Low-Resource Settings Abstract: In low-resource settings, prevalence mapping relies on empirical prevalence data from a finite, often spatially sparse, set of surveys of communities within the region of interest, possibly supplemented by remotely sensed images that can act as proxies for environmental risk factors. A standard geostatistical model for data of this kind is a generalized linear mixed model with binomial error distribution, logistic link, and a combination of explanatory variables and a Gaussian spatial stochastic process in the linear predictor. In this article, we first review statistical methods and software associated with this standard model, then consider several methodological extensions whose development has been motivated by the requirements of specific applications. These include: methods for combining randomized survey data with data from nonrandomized, and therefore potentially biased, surveys; spatio-temporal extensions; and spatially structured zero-inflation. Throughout, we illustrate the methods with disease mapping applications that have arisen through our involvement with a range of African public health programs. Journal: Journal of the American Statistical Association Pages: 1096-1120 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2015.1123158 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1123158 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1096-1120 Template-Type: ReDIF-Article 1.0 Author-Name: Yang Chen Author-X-Name-First: Yang Author-X-Name-Last: Chen Author-Name: Kuang Shen Author-X-Name-First: Kuang Author-X-Name-Last: Shen Author-Name: Shu-Ou Shan Author-X-Name-First: Shu-Ou Author-X-Name-Last: Shan Author-Name: S. C. Kou Author-X-Name-First: S. C. Author-X-Name-Last: Kou Title: Analyzing Single-Molecule Protein Transportation Experiments via Hierarchical Hidden Markov Models Abstract: To maintain proper cellular functions, over 50% of proteins encoded in the genome need to be transported to cellular membranes. The molecular mechanism behind such a process, often referred to as protein targeting, is not well understood. Single-molecule experiments are designed to unveil the detailed mechanisms and reveal the functions of different molecular machineries involved in the process. The experimental data consist of hundreds of stochastic time traces from the fluorescence recordings of the experimental system. We introduce a Bayesian hierarchical model on top of hidden Markov models (HMMs) to analyze these data and use the statistical results to answer the biological questions. In addition to resolving the biological puzzles and delineating the regulating roles of different molecular complexes, our statistical results enable us to propose a more detailed mechanism for the late stages of the protein targeting process. Journal: Journal of the American Statistical Association Pages: 951-966 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1140050 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1140050 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:951-966 Template-Type: ReDIF-Article 1.0 Author-Name: Alexander W. Blocker Author-X-Name-First: Alexander W. Author-X-Name-Last: Blocker Author-Name: Edoardo M. Airoldi Author-X-Name-First: Edoardo M. Author-X-Name-Last: Airoldi Title: Template-Based Models for Genome-Wide Analysis of Next-Generation Sequencing Data at Base-Pair Resolution Abstract: We consider the problem of estimating the genome-wide distribution of nucleosome positions from paired-end sequencing data. We develop a modeling approach based on nonparametric templates to control for the variability along the sequence of read counts associated with nucleosomal DNA due to enzymatic digestion and other sample preparation steps, and we develop a calibrated Bayesian method to detect local concentrations of nucleosome positions. We also introduce a set of estimands that provides rich, interpretable summaries of nucleosome positioning. Inference is carried out via a distributed Hamiltonian Monte Carlo algorithm that can scale linearly with the length of the genome being analyzed. We provide MPI-based Python implementations of the proposed methods, stand-alone and on Amazon EC2, which can provide inferences on an entire Saccharomyces cerevisiae genome in less than 1 hr on EC2. We evaluate the accuracy and reproducibility of the inferences leveraging a factorially designed simulation study and experimental replicates. The template-based approach we develop here is also applicable to single-end sequencing data by using alternative sources of fragment length information, and to ordered and sequential data more generally. It provides a flexible and scalable alternative to mixture models, hidden Markov models, and Parzen-window methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 967-987 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1141095 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1141095 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:967-987 Template-Type: ReDIF-Article 1.0 Author-Name: Margaret E. Roberts Author-X-Name-First: Margaret E. Author-X-Name-Last: Roberts Author-Name: Brandon M. Stewart Author-X-Name-First: Brandon M. Author-X-Name-Last: Stewart Author-Name: Edoardo M. Airoldi Author-X-Name-First: Edoardo M. Author-X-Name-Last: Airoldi Title: A Model of Text for Experimentation in the Social Sciences Abstract: Statistical models of text have become increasingly popular in statistics and computer science as a method of exploring large document collections. Social scientists often want to move beyond exploration, to measurement and experimentation, and make inference about social and political processes that drive discourse and content. In this article, we develop a model of text data that supports this type of substantive research. Our approach is to posit a hierarchical mixed membership model for analyzing topical content of documents, in which mixing weights are parameterized by observed covariates. In this model, topical prevalence and topical content are specified as a simple generalized linear model on an arbitrary number of document-level covariates, such as news source and time of release, enabling researchers to introduce elements of the experimental design that informed document collection into the model, within a generally applicable framework. We demonstrate the proposed methodology by analyzing a collection of news reports about China, where we allow the prevalence of topics to evolve over time and vary across newswire services. Our methods quantify the effect of news wire source on both the frequency and nature of topic coverage. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 988-1003 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1141684 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1141684 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:988-1003 Template-Type: ReDIF-Article 1.0 Author-Name: Sung Won Han Author-X-Name-First: Sung Won Author-X-Name-Last: Han Author-Name: Gong Chen Author-X-Name-First: Gong Author-X-Name-Last: Chen Author-Name: Myun-Seok Cheon Author-X-Name-First: Myun-Seok Author-X-Name-Last: Cheon Author-Name: Hua Zhong Author-X-Name-First: Hua Author-X-Name-Last: Zhong Title: Estimation of Directed Acyclic Graphs Through Two-Stage Adaptive Lasso for Gene Network Inference Abstract: Graphical models are a popular approach to find dependence and conditional independence relationships between gene expressions. Directed acyclic graphs (DAGs) are a special class of directed graphical models, where all the edges are directed edges and contain no directed cycles. The DAGs are well known models for discovering causal relationships between genes in gene regulatory networks. However, estimating DAGs without assuming known ordering is challenging due to high dimensionality, the acyclic constraints, and the presence of equivalence class from observational data. To overcome these challenges, we propose a two-stage adaptive Lasso approach, called NS-DIST, which performs neighborhood selection (NS) in stage 1, and then estimates DAGs by the discrete improving search with Tabu (DIST) algorithm within the selected neighborhood. Simulation studies are presented to demonstrate the effectiveness of the method and its computational efficiency. Two real data examples are used to demonstrate the practical usage of our method for gene regulatory network inference. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1004-1019 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1142880 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1142880 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1004-1019 Template-Type: ReDIF-Article 1.0 Author-Name: Shahin Tavakoli Author-X-Name-First: Shahin Author-X-Name-Last: Tavakoli Author-Name: Victor M. Panaretos Author-X-Name-First: Victor M. Author-X-Name-Last: Panaretos Title: Detecting and Localizing Differences in Functional Time Series Dynamics: A Case Study in Molecular Biophysics Abstract: Motivated by the problem of inferring the molecular dynamics of DNA in solution, and linking them with its base-pair composition, we consider the problem of comparing the dynamics of functional time series (FTS), and of localizing any inferred differences in frequency and along curvelength. The approach we take is one of Fourier analysis, where the complete second-order structure of the FTS is encoded by its spectral density operator, indexed by frequency and curvelength. The comparison is broken down to a hierarchy of stages: at a global level, we compare the spectral density operators of the two FTS, across frequencies and curvelength, based on a Hilbert–Schmidt criterion; then, we localize any differences to specific frequencies; and, finally, we further localize any differences along the length of the random curves, that is, in physical space. A hierarchical multiple testing approach guarantees control of the averaged false discovery rate over the selected frequencies. In this sense, we are able to attribute any differences to distinct dynamic (frequency) and spatial (curvelength) contributions. Our approach is presented and illustrated by means of a case study in molecular biophysics: how can one use molecular dynamics simulations of short strands of DNA to infer their temporal dynamics at the scaling limit, and probe whether these depend on the sequence encoded in these strands? Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1020-1035 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1147355 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1147355 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1020-1035 Template-Type: ReDIF-Article 1.0 Author-Name: Tyler H. McCormick Author-X-Name-First: Tyler H. Author-X-Name-Last: McCormick Author-Name: Zehang Richard Li Author-X-Name-First: Zehang Richard Author-X-Name-Last: Li Author-Name: Clara Calvert Author-X-Name-First: Clara Author-X-Name-Last: Calvert Author-Name: Amelia C. Crampin Author-X-Name-First: Amelia C. Author-X-Name-Last: Crampin Author-Name: Kathleen Kahn Author-X-Name-First: Kathleen Author-X-Name-Last: Kahn Author-Name: Samuel J. Clark Author-X-Name-First: Samuel J. Author-X-Name-Last: Clark Title: Probabilistic Cause-of-Death Assignment Using Verbal Autopsies Abstract: In regions without complete-coverage civil registration and vital statistics systems there is uncertainty about even the most basic demographic indicators. In such regions, the majority of deaths occur outside hospitals and are not recorded. Worldwide, fewer than one-third of deaths are assigned a cause, with the least information available from the most impoverished nations. In populations like this, verbal autopsy (VA) is a commonly used tool to assess cause of death and estimate cause-specific mortality rates and the distribution of deaths by cause. VA uses an interview with caregivers of the decedent to elicit data describing the signs and symptoms leading up to the death. This article develops a new statistical tool known as InSilicoVA to classify cause of death using information acquired through VA. InSilicoVA shares uncertainty between cause of death assignments for specific individuals and the distribution of deaths by cause across the population. Using side-by-side comparisons with both observed and simulated data, we demonstrate that InSilicoVA has distinct advantages compared to currently available methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1036-1049 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1152191 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1152191 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1036-1049 Template-Type: ReDIF-Article 1.0 Author-Name: Chen Yue Author-X-Name-First: Chen Author-X-Name-Last: Yue Author-Name: Vadim Zipunnikov Author-X-Name-First: Vadim Author-X-Name-Last: Zipunnikov Author-Name: Pierre-Louis Bazin Author-X-Name-First: Pierre-Louis Author-X-Name-Last: Bazin Author-Name: Dzung Pham Author-X-Name-First: Dzung Author-X-Name-Last: Pham Author-Name: Daniel Reich Author-X-Name-First: Daniel Author-X-Name-Last: Reich Author-Name: Ciprian Crainiceanu Author-X-Name-First: Ciprian Author-X-Name-Last: Crainiceanu Author-Name: Brian Caffo Author-X-Name-First: Brian Author-X-Name-Last: Caffo Title: Parameterization of White Matter Manifold-Like Structures Using Principal Surfaces Abstract: In this article, we are concerned with data generated from a diffusion tensor imaging (DTI) experiment. The goal is to parameterize manifold-like white matter tracts, such as the corpus callosum, using principal surfaces. The problem is approached by finding a geometrically motivated surface-based representation of the corpus callosum and visualized fractional anisotropy (FA) values projected onto the surface. The method also applies to any other diffusion summary. An algorithm is proposed that (a) constructs the principal surface of a corpus callosum; (b) flattens the surface into a parametric two-dimensional (2D) map; and (c) projects associated FA values on the map. The algorithm is applied to a longitudinal study containing 466 diffusion tensor images of 176 multiple sclerosis (MS) patients observed at multiple visits. For each subject and visit, the study contains a registered DTI scan of the corpus callosum at roughly 20,000 voxels. Extensive simulation studies demonstrate fast convergence and robust performance of the algorithm under a variety of challenging scenarios. Journal: Journal of the American Statistical Association Pages: 1050-1060 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1164050 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1164050 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1050-1060 Template-Type: ReDIF-Article 1.0 Author-Name: Kyu Ha Lee Author-X-Name-First: Kyu Ha Author-X-Name-Last: Lee Author-Name: Francesca Dominici Author-X-Name-First: Francesca Author-X-Name-Last: Dominici Author-Name: Deborah Schrag Author-X-Name-First: Deborah Author-X-Name-Last: Schrag Author-Name: Sebastien Haneuse Author-X-Name-First: Sebastien Author-X-Name-Last: Haneuse Title: Hierarchical Models for Semicompeting Risks Data With Application to Quality of End-of-Life Care for Pancreatic Cancer Abstract: Readmission following discharge from an initial hospitalization is a key marker of quality of healthcare in the United States. For the most part, readmission has been studied among patients with “acute” health conditions, such as pneumonia and heart failure, with analyses based on a logistic-Normal generalized linear mixed model. Naïve application of this model to the study of readmission among patients with “advanced” health conditions such as pancreatic cancer, however, is problematic because it ignores death as a competing risk. A more appropriate analysis is to imbed such a study within the semicompeting risks framework. To our knowledge, however, no comprehensive statistical methods have been developed for cluster-correlated semicompeting risks data. To resolve this gap in the literature we propose a novel hierarchical modeling framework for the analysis of cluster-correlated semicompeting risks data that permits parametric or nonparametric specifications for a range of components giving analysts substantial flexibility as they consider their own analyses. Estimation and inference is performed within the Bayesian paradigm since it facilitates the straightforward characterization of (posterior) uncertainty for all model parameters, including hospital-specific random effects. Model comparison and choice is performed via the deviance information criterion and the log-pseudo marginal likelihood statistic, both of which are based on a partially marginalized likelihood. An efficient computational scheme, based on the Metropolis-Hastings-Green algorithm, is developed and had been implemented in the R package SemiCompRisks. A comprehensive simulation study shows that the proposed framework performs very well in a range of data scenarios, and outperforms competitor analysis strategies. The proposed framework is motivated by and illustrated with an ongoing study of the risk of readmission among Medicare beneficiaries diagnosed with pancreatic cancer. Using data on n = 5298 patients at J=112 hospitals in the six New England states between 2000–2009, key scientific questions we consider include the role of patient-level risk factors on the risk of readmission and the extent of variation in risk across hospitals not explained by differences in patient case-mix. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1075-1095 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1164052 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1164052 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1075-1095 Template-Type: ReDIF-Article 1.0 Author-Name: Leonhard Held Author-X-Name-First: Leonhard Author-X-Name-Last: Held Author-Name: Stefanie Muff Author-X-Name-First: Stefanie Author-X-Name-Last: Muff Title: Comment Journal: Journal of the American Statistical Association Pages: 1108-1110 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1164705 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1164705 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1108-1110 Template-Type: ReDIF-Article 1.0 Author-Name: Jan Hannig Author-X-Name-First: Jan Author-X-Name-Last: Hannig Author-Name: Hari Iyer Author-X-Name-First: Hari Author-X-Name-Last: Iyer Author-Name: Randy C. S. Lai Author-X-Name-First: Randy C. S. Author-X-Name-Last: Lai Author-Name: Thomas C. M. Lee Author-X-Name-First: Thomas C. M. Author-X-Name-Last: Lee Title: Generalized Fiducial Inference: A Review and New Results Abstract: R. A. Fisher, the father of modern statistics, proposed the idea of fiducial inference during the first half of the 20th century. While his proposal led to interesting methods for quantifying uncertainty, other prominent statisticians of the time did not accept Fisher’s approach as it became apparent that some of Fisher’s bold claims about the properties of fiducial distribution did not hold up for multi-parameter problems. Beginning around the year 2000, the authors and collaborators started to reinvestigate the idea of fiducial inference and discovered that Fisher’s approach, when properly generalized, would open doors to solve many important and difficult inference problems. They termed their generalization of Fisher’s idea as generalized fiducial inference (GFI). The main idea of GFI is to carefully transfer randomness from the data to the parameter space using an inverse of a data-generating equation without the use of Bayes’ theorem. The resulting generalized fiducial distribution (GFD) can then be used for inference. After more than a decade of investigations, the authors and collaborators have developed a unifying theory for GFI, and provided GFI solutions to many challenging practical problems in different fields of science and industry. Overall, they have demonstrated that GFI is a valid, useful, and promising approach for conducting statistical inference. The goal of this article is to deliver a timely and concise introduction to GFI, to present some of the latest results, as well as to list some related open research problems. It is authors’ hope that their contributions to GFI will stimulate the growth and usage of this exciting approach for statistical inference. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1346-1361 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1165102 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1165102 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1346-1361 Template-Type: ReDIF-Article 1.0 Author-Name: Fiona Steele Author-X-Name-First: Fiona Author-X-Name-Last: Steele Author-Name: Elizabeth Washbrook Author-X-Name-First: Elizabeth Author-X-Name-Last: Washbrook Author-Name: Christopher Charlton Author-X-Name-First: Christopher Author-X-Name-Last: Charlton Author-Name: William J. Browne Author-X-Name-First: William J. Author-X-Name-Last: Browne Title: A Longitudinal Mixed Logit Model for Estimation of Push and Pull Effects in Residential Location Choice Abstract: We develop a random effects discrete choice model for the analysis of households’ choice of neighborhood over time. The model is parameterized in a way that exploits longitudinal data to separate the influence of neighborhood characteristics on the decision to move out of the current area (“push” effects) and on the choice of one destination over another (“pull” effects). Random effects are included to allow for unobserved heterogeneity between households in their propensity to move, and in the importance placed on area characteristics. The model also includes area-level random effects. The combination of a large choice set, large sample size, and repeated observations mean that existing estimation approaches are often infeasible. We, therefore, propose an efficient MCMC algorithm for the analysis of large-scale datasets. The model is applied in an analysis of residential choice in England using data from the British Household Panel Survey linked to neighborhood-level census data. We consider how effects of area deprivation and distance from the current area depend on household characteristics and life course transitions in the previous year. We find substantial differences between households in the effects of deprivation on out-mobility and selection of destination, with evidence of severely constrained choices among less-advantaged households. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1061-1074 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1180984 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1180984 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1061-1074 Template-Type: ReDIF-Article 1.0 Author-Name: Qian Guan Author-X-Name-First: Qian Author-X-Name-Last: Guan Author-Name: Eric B. Laber Author-X-Name-First: Eric B. Author-X-Name-Last: Laber Author-Name: Brian J. Reich Author-X-Name-First: Brian J. Author-X-Name-Last: Reich Title: Comment Journal: Journal of the American Statistical Association Pages: 936-942 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1200911 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1200911 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:936-942 Template-Type: ReDIF-Article 1.0 Author-Name: Jingxiang Chen Author-X-Name-First: Jingxiang Author-X-Name-Last: Chen Author-Name: Yufeng Liu Author-X-Name-First: Yufeng Author-X-Name-Last: Liu Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Rui Song Author-X-Name-First: Rui Author-X-Name-Last: Song Author-Name: Yingqi Zhao Author-X-Name-First: Yingqi Author-X-Name-Last: Zhao Author-Name: Michael R. Kosorok Author-X-Name-First: Michael R. Author-X-Name-Last: Kosorok Title: Comment Abstract: Xu, Müller, Wahed, and Thall proposed a Bayesian model to analyze an acute leukemia study involving multi-stage chemotherapy regimes. We discuss two alternative methods, Q-learning and O-learning, to solve the same problem from the machine learning point of view. The numerical studies show that these methods can be flexible and have advantages in some situations to handle treatment heterogeneity while being robust to model misspecification. Journal: Journal of the American Statistical Association Pages: 942-947 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1200914 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1200914 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:942-947 Template-Type: ReDIF-Article 1.0 Author-Name: Lorenzo Trippa Author-X-Name-First: Lorenzo Author-X-Name-Last: Trippa Author-Name: Giovanni Parmigiani Author-X-Name-First: Giovanni Author-X-Name-Last: Parmigiani Title: Comment Journal: Journal of the American Statistical Association Pages: 947-948 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1200915 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1200915 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:947-948 Template-Type: ReDIF-Article 1.0 Author-Name: Yanxun Xu Author-X-Name-First: Yanxun Author-X-Name-Last: Xu Author-Name: Peter Müller Author-X-Name-First: Peter Author-X-Name-Last: Müller Author-Name: Abdus S. Wahed Author-X-Name-First: Abdus S. Author-X-Name-Last: Wahed Author-Name: Peter Thall Author-X-Name-First: Peter Author-X-Name-Last: Thall Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 948-950 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1200917 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1200917 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:948-950 Template-Type: ReDIF-Article 1.0 Author-Name: Jon Wakefield Author-X-Name-First: Jon Author-X-Name-Last: Wakefield Author-Name: Daniel Simpson Author-X-Name-First: Daniel Author-X-Name-Last: Simpson Author-Name: Jessica Godwin Author-X-Name-First: Jessica Author-X-Name-Last: Godwin Title: Comment: Getting into Space with a Weight Problem Journal: Journal of the American Statistical Association Pages: 1111-1118 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1200918 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1200918 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1111-1118 Template-Type: ReDIF-Article 1.0 Author-Name: Peter J. Diggle Author-X-Name-First: Peter J. Author-X-Name-Last: Diggle Author-Name: Emanuele Giorgi Author-X-Name-First: Emanuele Author-X-Name-Last: Giorgi Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 1119-1120 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1200919 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1200919 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1119-1120 Template-Type: ReDIF-Article 1.0 Author-Name: Wouter Duivesteijn Author-X-Name-First: Wouter Author-X-Name-Last: Duivesteijn Title: Correction to Jin-Ting Zhang’s “Approximate and Asymptotic Distributions of Chi-Squared-Type Mixtures With Applications’’ Abstract: Zhang derives approximations for the distribution of a mixture of chi-squared distributions. The two derived approximation bounds in Theorem 1.1 both contain an arithmetic error. These are corrected here. Journal: Journal of the American Statistical Association Pages: 1370-1371 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1200980 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1200980 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1370-1371 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Book Reviews Journal: Journal of the American Statistical Association Pages: 1362-1369 Issue: 515 Volume: 111 Year: 2016 Month: 7 X-DOI: 10.1080/01621459.2016.1235436 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1235436 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1362-1369 Template-Type: ReDIF-Article 1.0 Author-Name: Ning Hao Author-X-Name-First: Ning Author-X-Name-Last: Hao Author-Name: Yang Feng Author-X-Name-First: Yang Author-X-Name-Last: Feng Author-Name: Hao Helen Zhang Author-X-Name-First: Hao Helen Author-X-Name-Last: Zhang Title: Model Selection for High-Dimensional Quadratic Regression via Regularization Abstract: Quadratic regression (QR) models naturally extend linear models by considering interaction effects between the covariates. To conduct model selection in QR, it is important to maintain the hierarchical model structure between main effects and interaction effects. Existing regularization methods generally achieve this goal by solving complex optimization problems, which usually demands high computational cost and hence are not feasible for high-dimensional data. This article focuses on scalable regularization methods for model selection in high-dimensional QR. We first consider two-stage regularization methods and establish theoretical properties of the two-stage LASSO. Then, a new regularization method, called regularization algorithm under marginality principle (RAMP), is proposed to compute a hierarchy-preserving regularization solution path efficiently. Both methods are further extended to solve generalized QR models. Numerical results are also shown to demonstrate performance of the methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 615-625 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2016.1264956 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1264956 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:615-625 Template-Type: ReDIF-Article 1.0 Author-Name: Antonio R. Linero Author-X-Name-First: Antonio R. Author-X-Name-Last: Linero Title: Bayesian Regression Trees for High-Dimensional Prediction and Variable Selection Abstract: Decision tree ensembles are an extremely popular tool for obtaining high-quality predictions in nonparametric regression problems. Unmodified, however, many commonly used decision tree ensemble methods do not adapt to sparsity in the regime in which the number of predictors is larger than the number of observations. A recent stream of research concerns the construction of decision tree ensembles that are motivated by a generative probabilistic model, the most influential method being the Bayesian additive regression trees (BART) framework. In this article, we take a Bayesian point of view on this problem and show how to construct priors on decision tree ensembles that are capable of adapting to sparsity in the predictors by placing a sparsity-inducing Dirichlet hyperprior on the splitting proportions of the regression tree prior. We characterize the asymptotic distribution of the number of predictors included in the model and show how this prior can be easily incorporated into existing Markov chain Monte Carlo schemes. We demonstrate that our approach yields useful posterior inclusion probabilities for each predictor and illustrate the usefulness of our approach relative to other decision tree ensemble approaches on both simulated and real datasets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 626-636 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2016.1264957 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1264957 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:626-636 Template-Type: ReDIF-Article 1.0 Author-Name: Ting Zhang Author-X-Name-First: Ting Author-X-Name-Last: Zhang Author-Name: Liliya Lavitas Author-X-Name-First: Liliya Author-X-Name-Last: Lavitas Title: Unsupervised Self-Normalized Change-Point Testing for Time Series Abstract: We propose a new self-normalized method for testing change points in the time series setting. Self-normalization has been celebrated for its ability to avoid direct estimation of the nuisance asymptotic variance and its flexibility of being generalized to handle quantities other than the mean. However, it was developed and mainly studied for constructing confidence intervals for quantities associated with a stationary time series, and its adaptation to change-point testing can be nontrivial as direct implementation can lead to tests with nonmonotonic power. Compared with existing results on using self-normalization in this direction, the current article proposes a new self-normalized change-point test that does not require prespecifying the number of total change points and is thus unsupervised. In addition, we propose a new contrast-based approach in generalizing self-normalized statistics to handle quantities other than the mean, which is specifically tailored for change-point testing. Monte Carlo simulations are presented to illustrate the finite-sample performance of the proposed method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 637-648 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2016.1270214 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270214 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:637-648 Template-Type: ReDIF-Article 1.0 Author-Name: Clara Happ Author-X-Name-First: Clara Author-X-Name-Last: Happ Author-Name: Sonja Greven Author-X-Name-First: Sonja Author-X-Name-Last: Greven Title: Multivariate Functional Principal Component Analysis for Data Observed on Different (Dimensional) Domains Abstract: Existing approaches for multivariate functional principal component analysis are restricted to data on the same one-dimensional interval. The presented approach focuses on multivariate functional data on different domains that may differ in dimension, such as functions and images. The theoretical basis for multivariate functional principal component analysis is given in terms of a Karhunen–Loève Theorem. For the practically relevant case of a finite Karhunen–Loève representation, a relationship between univariate and multivariate functional principal component analysis is established. This offers an estimation strategy to calculate multivariate functional principal components and scores based on their univariate counterparts. For the resulting estimators, asymptotic results are derived. The approach can be extended to finite univariate expansions in general, not necessarily orthonormal bases. It is also applicable for sparse functional data or data with measurement error. A flexible R implementation is available on CRAN. The new method is shown to be competitive to existing approaches for data observed on a common one-dimensional domain. The motivating application is a neuroimaging study, where the goal is to explore how longitudinal trajectories of a neuropsychological test score covary with FDG-PET brain scans at baseline. Supplementary material, including detailed proofs, additional simulation results, and software is available online. Journal: Journal of the American Statistical Association Pages: 649-659 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2016.1273115 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1273115 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:649-659 Template-Type: ReDIF-Article 1.0 Author-Name: Alexander Hanbo Li Author-X-Name-First: Alexander Hanbo Author-X-Name-Last: Li Author-Name: Jelena Bradic Author-X-Name-First: Jelena Author-X-Name-Last: Bradic Title: Boosting in the Presence of Outliers: Adaptive Classification With Nonconvex Loss Functions Abstract: This article examines the role and the efficiency of nonconvex loss functions for binary classification problems. In particular, we investigate how to design adaptive and effective boosting algorithms that are robust to the presence of outliers in the data or to the presence of errors in the observed data labels. We demonstrate that nonconvex losses play an important role for prediction accuracy because of the diminishing gradient properties—the ability of the losses to efficiently adapt to the outlying data. We propose a new boosting framework called ArchBoost that uses diminishing gradient property directly and leads to boosting algorithms that are provably robust. Along with the ArchBoost framework, a family of nonconvex losses is proposed, which leads to the new robust boosting algorithms, named adaptive robust boosting (ARB). Furthermore, we develop a new breakdown point analysis and a new influence function analysis that demonstrate gains in robustness. Moreover, based only on local curvatures, we establish statistical and optimization properties of the proposed ArchBoost algorithms with highly nonconvex losses. Extensive numerical and real data examples illustrate theoretical properties and reveal advantages over the existing boosting methods when data are perturbed by an adversary or otherwise. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 660-674 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2016.1273116 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1273116 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:660-674 Template-Type: ReDIF-Article 1.0 Author-Name: Federico Bassetti Author-X-Name-First: Federico Author-X-Name-Last: Bassetti Author-Name: Roberto Casarin Author-X-Name-First: Roberto Author-X-Name-Last: Casarin Author-Name: Francesco Ravazzolo Author-X-Name-First: Francesco Author-X-Name-Last: Ravazzolo Title: Bayesian Nonparametric Calibration and Combination of Predictive Distributions Abstract: We introduce a Bayesian approach to predictive density calibration and combination that accounts for parameter uncertainty and model set incompleteness through the use of random calibration functionals and random combination weights. Building on the work of Ranjan and Gneiting, we use infinite beta mixtures for the calibration. The proposed Bayesian nonparametric approach takes advantage of the flexibility of Dirichlet process mixtures to achieve any continuous deformation of linearly combined predictive distributions. The inference procedure is based on combination Gibbs and slice sampling. We provide some conditions under which the proposed probabilistic calibration converges in terms of weak posterior consistency to the true underlying density for both cases of iid and Markovian observations. This calibration property improves upon the earlier calibration approaches. We study the methodology in simulation examples with fat tails and multimodal densities and apply it to density forecasts of daily S&P returns and daily maximum wind speed at the Frankfurt airport. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 675-685 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2016.1273117 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1273117 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:675-685 Template-Type: ReDIF-Article 1.0 Author-Name: L. Fattorini Author-X-Name-First: L. Author-X-Name-Last: Fattorini Author-Name: M. Marcheselli Author-X-Name-First: M. Author-X-Name-Last: Marcheselli Author-Name: L. Pratelli Author-X-Name-First: L. Author-X-Name-Last: Pratelli Title: Design-Based Maps for Finite Populations of Spatial Units Abstract: The estimation of the values of a survey variable in finite populations of spatial units is considered for making maps when samples of spatial units are selected by probabilistic sampling schemes. The single values are estimated by means of an inverse distance weighting predictor. The design-based asymptotic properties of the resulting maps, referred to as the design-based maps, are considered when the study area remains fixed and the sizes of the spatial units tend to zero. Conditions ensuring design-based asymptotic unbiasedness and consistency are derived. They essentially require the existence of a pointwise or uniformly continuous density function of the survey variable onto the study area, some regularities in the size and shape of the units, and the use of spatially balanced designs to select units. The continuity assumption can be relaxed into a Riemann integrability assumption when estimation is performed at a sufficiently small spatial grain and the estimates are subsequently aggregated at a greater grain. A computationally simple mean squared error estimator is proposed. A simulation study is performed to assess the theoretical results. An application to estimate the map of wine cultivations in Tuscany (Central Italy) is considered. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 686-697 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2016.1278174 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1278174 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:686-697 Template-Type: ReDIF-Article 1.0 Author-Name: Asaf Weinstein Author-X-Name-First: Asaf Author-X-Name-Last: Weinstein Author-Name: Zhuang Ma Author-X-Name-First: Zhuang Author-X-Name-Last: Ma Author-Name: Lawrence D. Brown Author-X-Name-First: Lawrence D. Author-X-Name-Last: Brown Author-Name: Cun-Hui Zhang Author-X-Name-First: Cun-Hui Author-X-Name-Last: Zhang Title: Group-Linear Empirical Bayes Estimates for a Heteroscedastic Normal Mean Abstract: The problem of estimating the mean of a normal vector with known but unequal variances introduces substantial difficulties that impair the adequacy of traditional empirical Bayes estimators. By taking a different approach that treats the known variances as part of the random observations, we restore symmetry and thus the effectiveness of such methods. We suggest a group-linear empirical Bayes estimator, which collects observations with similar variances and applies a spherically symmetric estimator to each group separately. The proposed estimator is motivated by a new oracle rule which is stronger than the best linear rule, and thus provides a more ambitious benchmark than that considered in the previous literature. Our estimator asymptotically achieves the new oracle risk (under appropriate conditions) and at the same time is minimax. The group-linear estimator is particularly advantageous in situations where the true means and observed variances are empirically dependent. To demonstrate the merits of the proposed methods in real applications, we analyze the baseball data used by Brown (2008), where the group-linear methods achieved the prediction error of the best nonparametric estimates that have been applied to the dataset, and significantly lower error than other parametric and semiparametric empirical Bayes estimators. Journal: Journal of the American Statistical Association Pages: 698-710 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1280406 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1280406 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:698-710 Template-Type: ReDIF-Article 1.0 Author-Name: Holger Dette Author-X-Name-First: Holger Author-X-Name-Last: Dette Author-Name: Kathrin Möllenhoff Author-X-Name-First: Kathrin Author-X-Name-Last: Möllenhoff Author-Name: Stanislav Volgushev Author-X-Name-First: Stanislav Author-X-Name-Last: Volgushev Author-Name: Frank Bretz Author-X-Name-First: Frank Author-X-Name-Last: Bretz Title: Equivalence of Regression Curves Abstract: This article investigates the problem whether the difference between two parametric models m1, m2 describing the relation between a response variable and several covariates in two different groups is practically irrelevant, such that inference can be performed on the basis of the pooled sample. Statistical methodology is developed to test the hypotheses H0: d(m1, m2) ⩾ ϵ versus H1: d(m1, m2) < ϵ to demonstrate equivalence between the two regression curves m1, m2 for a prespecified threshold ϵ, where d denotes a distance measuring the distance between m1 and m2. Our approach is based on the asymptotic properties of a suitable estimator d(m^1,m^2)$d(\hat{m}_1, \hat{m}_2)$ of this distance. To improve the approximation of the nominal level for small sample sizes, a bootstrap test is developed, which addresses the specific form of the interval hypotheses. In particular, data have to be generated under the null hypothesis, which implicitly defines a manifold for the parameter vector. The results are illustrated by means of a simulation study and a data example. It is demonstrated that the new methods substantially improve currently available approaches with respect to power and approximation of the nominal level. Journal: Journal of the American Statistical Association Pages: 711-729 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1281813 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1281813 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:711-729 Template-Type: ReDIF-Article 1.0 Author-Name: Chong Zhang Author-X-Name-First: Chong Author-X-Name-Last: Zhang Author-Name: Wenbo Wang Author-X-Name-First: Wenbo Author-X-Name-Last: Wang Author-Name: Xingye Qiao Author-X-Name-First: Xingye Author-X-Name-Last: Qiao Title: On Reject and Refine Options in Multicategory Classification Abstract: In many real applications of statistical learning, a decision made from misclassification can be too costly to afford; in this case, a reject option, which defers the decision until further investigation is conducted, is often preferred. In recent years, there has been much development for binary classification with a reject option. Yet, little progress has been made for the multicategory case. In this article, we propose margin-based multicategory classification methods with a reject option. In addition, and more importantly, we introduce a new and unique refine option for the multicategory problem, where the class of an observation is predicted to be from a set of class labels, whose cardinality is not necessarily one. The main advantage of both options lies in their capacity of identifying error-prone observations. Moreover, the refine option can provide more constructive information for classification by effectively ruling out implausible classes. Efficient implementations have been developed for the proposed methods. On the theoretical side, we offer a novel statistical learning theory and show a fast convergence rate of the excess ℓ-risk of our methods with emphasis on diverging dimensionality and number of classes. The results can be further improved under a low noise assumption and be generalized to the excess 0-d-1 risk. Finite-sample upper bounds for the reject and reject/refine rates are also provided. A set of comprehensive simulation and real data studies has shown the usefulness of the new learning tools compared to regular multicategory classifiers. Detailed proofs of theorems and extended numerical results are included in the supplemental materials available online. Journal: Journal of the American Statistical Association Pages: 730-745 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1282372 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1282372 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:730-745 Template-Type: ReDIF-Article 1.0 Author-Name: Kejun He Author-X-Name-First: Kejun Author-X-Name-Last: He Author-Name: Heng Lian Author-X-Name-First: Heng Author-X-Name-Last: Lian Author-Name: Shujie Ma Author-X-Name-First: Shujie Author-X-Name-Last: Ma Author-Name: Jianhua Z. Huang Author-X-Name-First: Jianhua Z. Author-X-Name-Last: Huang Title: Dimensionality Reduction and Variable Selection in Multivariate Varying-Coefficient Models With a Large Number of Covariates Abstract: Motivated by the study of gene and environment interactions, we consider a multivariate response varying-coefficient model with a large number of covariates. The need of nonparametrically estimating a large number of coefficient functions given relatively limited data poses a big challenge for fitting such a model. To overcome the challenge, we develop a method that incorporates three ideas: (i) reduce the number of unknown functions to be estimated by using (noncentered) principal components; (ii) approximate the unknown functions by polynomial splines; (iii) apply sparsity-inducing penalization to select relevant covariates. The three ideas are integrated into a penalized least-square framework. Our asymptotic theory shows that the proposed method can consistently identify relevant covariates and can estimate the corresponding coefficient functions with the same convergence rate as when only the relevant variables are included in the model. We also develop a novel computational algorithm to solve the penalized least-square problem by combining proximal algorithms and optimization over Stiefel manifolds. Our method is illustrated using data from Framingham Heart Study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 746-754 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1285774 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1285774 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:746-754 Template-Type: ReDIF-Article 1.0 Author-Name: Forrest W. Crawford Author-X-Name-First: Forrest W. Author-X-Name-Last: Crawford Author-Name: Jiacheng Wu Author-X-Name-First: Jiacheng Author-X-Name-Last: Wu Author-Name: Robert Heimer Author-X-Name-First: Robert Author-X-Name-Last: Heimer Title: Hidden Population Size Estimation From Respondent-Driven Sampling: A Network Approach Abstract: Estimating the size of stigmatized, hidden, or hard-to-reach populations is a major problem in epidemiology, demography, and public health research. Capture–recapture and multiplier methods are standard tools for inference of hidden population sizes, but they require random sampling of target population members, which is rarely possible. Respondent-driven sampling (RDS) is a survey method for hidden populations that relies on social link tracing. The RDS recruitment process is designed to spread through the social network connecting members of the target population. In this article, we show how to use network data revealed by RDS to estimate hidden population size. The key insight is that the recruitment chain, timing of recruitments, and network degrees of recruited subjects provide information about the number of individuals belonging to the target population who are not yet in the sample. We use a computationally efficient Bayesian method to integrate over the missing edges in the subgraph of recruited individuals. We validate the method using simulated data and apply the technique to estimate the number of people who inject drugs in St. Petersburg, Russia. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 755-766 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1285775 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1285775 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:755-766 Template-Type: ReDIF-Article 1.0 Author-Name: Sebastian Calonico Author-X-Name-First: Sebastian Author-X-Name-Last: Calonico Author-Name: Matias D. Cattaneo Author-X-Name-First: Matias D. Author-X-Name-Last: Cattaneo Author-Name: Max H. Farrell Author-X-Name-First: Max H. Author-X-Name-Last: Farrell Title: On the Effect of Bias Estimation on Coverage Accuracy in Nonparametric Inference Abstract: Nonparametric methods play a central role in modern empirical work. While they provide inference procedures that are more robust to parametric misspecification bias, they may be quite sensitive to tuning parameter choices. We study the effects of bias correction on confidence interval coverage in the context of kernel density and local polynomial regression estimation, and prove that bias correction can be preferred to undersmoothing for minimizing coverage error and increasing robustness to tuning parameter choice. This is achieved using a novel, yet simple, Studentization, which leads to a new way of constructing kernel-based bias-corrected confidence intervals. In addition, for practical cases, we derive coverage error optimal bandwidths and discuss easy-to-implement bandwidth selectors. For interior points, we show that the mean-squared error (MSE)-optimal bandwidth for the original point estimator (before bias correction) delivers the fastest coverage error decay rate after bias correction when second-order (equivalent) kernels are employed, but is otherwise suboptimal because it is too “large.” Finally, for odd-degree local polynomial regression, we show that, as with point estimation, coverage error adapts to boundary points automatically when appropriate Studentization is used; however, the MSE-optimal bandwidth for the original point estimator is suboptimal. All the results are established using valid Edgeworth expansions and illustrated with simulated data. Our findings have important consequences for empirical work as they indicate that bias-corrected confidence intervals, coupled with appropriate standard errors, have smaller coverage error and are less sensitive to tuning parameter choices in practically relevant cases where additional smoothness is available. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 767-779 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1285776 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1285776 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:767-779 Template-Type: ReDIF-Article 1.0 Author-Name: Alexander R. Luedtke Author-X-Name-First: Alexander R. Author-X-Name-Last: Luedtke Author-Name: Mark J. van der Laan Author-X-Name-First: Mark J. van der Author-X-Name-Last: Laan Title: Parametric-Rate Inference for One-Sided Differentiable Parameters Abstract: Suppose one has a collection of parameters indexed by a (possibly infinite dimensional) set. Given data generated from some distribution, the objective is to estimate the maximal parameter in this collection evaluated at the distribution that generated the data. This estimation problem is typically nonregular when the maximizing parameter is nonunique, and as a result standard asymptotic techniques generally fail in this case. We present a technique for developing parametric-rate confidence intervals for the quantity of interest in these nonregular settings. We show that our estimator is asymptotically efficient when the maximizing parameter is unique so that regular estimation is possible. We apply our technique to a recent example from the literature in which one wishes to report the maximal absolute correlation between a prespecified outcome and one of p predictors. The simplicity of our technique enables an analysis of the previously open case where p grows with sample size. Specifically, we only require that log p grows slower than n$\sqrt{n}$, where n is the sample size. We show that, unlike earlier approaches, our method scales to massive datasets: the point estimate and confidence intervals can be constructed in O(np) time. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 780-788 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1285777 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1285777 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:780-788 Template-Type: ReDIF-Article 1.0 Author-Name: Ery Arias-Castro Author-X-Name-First: Ery Author-X-Name-Last: Arias-Castro Author-Name: Rui M. Castro Author-X-Name-First: Rui M. Author-X-Name-Last: Castro Author-Name: Ervin Tánczos Author-X-Name-First: Ervin Author-X-Name-Last: Tánczos Author-Name: Meng Wang Author-X-Name-First: Meng Author-X-Name-Last: Wang Title: Distribution-Free Detection of Structured Anomalies: Permutation and Rank-Based Scans Abstract: The scan statistic is by far the most popular method for anomaly detection, being popular in syndromic surveillance, signal and image processing, and target detection based on sensor networks, among other applications. The use of the scan statistics in such settings yields a hypothesis testing procedure, where the null hypothesis corresponds to the absence of anomalous behavior. If the null distribution is known, then calibration of a scan-based test is relatively easy, as it can be done by Monte Carlo simulation. When the null distribution is unknown, it is less straightforward. We investigate two procedures. The first one is a calibration by permutation and the other is a rank-based scan test, which is distribution-free and less sensitive to outliers. Furthermore, the rank scan test requires only a one-time calibration for a given data size making it computationally much more appealing. In both cases, we quantify the performance loss with respect to an oracle scan test that knows the null distribution. We show that using one of these calibration procedures results in only a very small loss of power in the context of a natural exponential family. This includes the classical normal location model, popular in signal processing, and the Poisson model, popular in syndromic surveillance. We perform numerical experiments on simulated data further supporting our theory and also on a real dataset from genomics. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 789-801 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1286240 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1286240 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:789-801 Template-Type: ReDIF-Article 1.0 Author-Name: Li Ma Author-X-Name-First: Li Author-X-Name-Last: Ma Author-Name: Jacopo Soriano Author-X-Name-First: Jacopo Author-X-Name-Last: Soriano Title: Efficient Functional ANOVA Through Wavelet-Domain Markov Groves Abstract: We introduce a wavelet-domain method for functional analysis of variance (fANOVA). It is based on a Bayesian hierarchical model that employs a graphical hyperprior in the form of a Markov grove (MG)—that is, a collection of Markov trees—for linking the presence/absence of factor effects at all location-scale combinations, thereby incorporating the natural clustering of factor effects in the wavelet-domain across locations and scales. Inference under the model enjoys both analytical simplicity and computational efficiency. Specifically, the posterior of the full hierarchical model is available in closed form through a pyramid algorithm operationally similar to Mallat’s pyramid algorithm for discrete wavelet transform (DWT), achieving for exact Bayesian inference the same computational efficiency—linear in both the number of observations and the number of locations—as for carrying out the DWT. In particular, posterior probabilities of the presence of factor contributions to functional variation are directly available from the pyramid algorithm, while posterior samples for the factor effects can be drawn directly from the exact posterior through standard (not Markov chain) Monte Carlo. We investigate the performance of our method through extensive simulation and show that it substantially outperforms existing wavelet-domain fANOVA methods in a variety of common settings. We illustrate the method through analyzing the orthosis data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 802-818 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1286241 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1286241 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:802-818 Template-Type: ReDIF-Article 1.0 Author-Name: Joshua Chan Author-X-Name-First: Joshua Author-X-Name-Last: Chan Author-Name: Roberto Leon-Gonzalez Author-X-Name-First: Roberto Author-X-Name-Last: Leon-Gonzalez Author-Name: Rodney W. Strachan Author-X-Name-First: Rodney W. Author-X-Name-Last: Strachan Title: Invariant Inference and Efficient Computation in the Static Factor Model Abstract: Factor models are used in a wide range of areas. Two issues with Bayesian versions of these models are a lack of invariance to ordering of and scaling of the variables and computational inefficiency. This article develops invariant and efficient Bayesian methods for estimating static factor models. This approach leads to inference that does not depend upon the ordering or scaling of the variables, and we provide arguments to explain this invariance. Beginning from identified parameters which are subject to orthogonality restrictions, we use parameter expansions to obtain a specification with computationally convenient conditional posteriors. We show significant gains in computational efficiency. Identifying restrictions that are commonly employed result in interpretable factors or loadings and, using our approach, these can be imposed ex-post. This allows us to investigate several alternative identifying (noninvariant) schemes without the need to respecify and resample the model. We illustrate the methods with two macroeconomic datasets. Journal: Journal of the American Statistical Association Pages: 819-828 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1287080 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1287080 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:819-828 Template-Type: ReDIF-Article 1.0 Author-Name: HaiYing Wang Author-X-Name-First: HaiYing Author-X-Name-Last: Wang Author-Name: Rong Zhu Author-X-Name-First: Rong Author-X-Name-Last: Zhu Author-Name: Ping Ma Author-X-Name-First: Ping Author-X-Name-Last: Ma Title: Optimal Subsampling for Large Sample Logistic Regression Abstract: For massive data, the family of subsampling algorithms is popular to downsize the data volume and reduce computational burden. Existing studies focus on approximating the ordinary least-square estimate in linear regression, where statistical leverage scores are often used to define subsampling probabilities. In this article, we propose fast subsampling algorithms to efficiently approximate the maximum likelihood estimate in logistic regression. We first establish consistency and asymptotic normality of the estimator from a general subsampling algorithm, and then derive optimal subsampling probabilities that minimize the asymptotic mean squared error of the resultant estimator. An alternative minimization criterion is also proposed to further reduce the computational cost. The optimal subsampling probabilities depend on the full data estimate, so we develop a two-step algorithm to approximate the optimal subsampling procedure. This algorithm is computationally efficient and has a significant reduction in computing time compared to the full data approach. Consistency and asymptotic normality of the estimator from a two-step algorithm are also established. Synthetic and real datasets are used to evaluate the practical performance of the proposed method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 829-844 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1292914 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1292914 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:829-844 Template-Type: ReDIF-Article 1.0 Author-Name: Dungang Liu Author-X-Name-First: Dungang Author-X-Name-Last: Liu Author-Name: Heping Zhang Author-X-Name-First: Heping Author-X-Name-Last: Zhang Title: Residuals and Diagnostics for Ordinal Regression Models: A Surrogate Approach Abstract: Ordinal outcomes are common in scientific research and everyday practice, and we often rely on regression models to make inference. A long-standing problem with such regression analyses is the lack of effective diagnostic tools for validating model assumptions. The difficulty arises from the fact that an ordinal variable has discrete values that are labeled with, but not, numerical values. The values merely represent ordered categories. In this article, we propose a surrogate approach to defining residuals for an ordinal outcome Y. The idea is to define a continuous variable S as a “surrogate” of Y and then obtain residuals based on S. For the general class of cumulative link regression models, we study the residual’s theoretical and graphical properties. We show that the residual has null properties similar to those of the common residuals for continuous outcomes. Our numerical studies demonstrate that the residual has power to detect misspecification with respect to (1) mean structures; (2) link functions; (3) heteroscedasticity; (4) proportionality; and (5) mixed populations. The proposed residual also enables us to develop numeric measures for goodness of fit using classical distance notions. Our results suggest that compared to a previously defined residual, our residual can reveal deeper insights into model diagnostics. We stress that this work focuses on residual analysis, rather than hypothesis testing. The latter has limited utility as it only provides a single p-value, whereas our residual can reveal what components of the model are misspecified and advise how to make improvements. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 845-854 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1292915 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1292915 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:845-854 Template-Type: ReDIF-Article 1.0 Author-Name: Alexandre Bouchard-Côté Author-X-Name-First: Alexandre Author-X-Name-Last: Bouchard-Côté Author-Name: Sebastian J. Vollmer Author-X-Name-First: Sebastian J. Author-X-Name-Last: Vollmer Author-Name: Arnaud Doucet Author-X-Name-First: Arnaud Author-X-Name-Last: Doucet Title: The Bouncy Particle Sampler: A Nonreversible Rejection-Free Markov Chain Monte Carlo Method Abstract: Many Markov chain Monte Carlo techniques currently available rely on discrete-time reversible Markov processes whose transition kernels are variations of the Metropolis–Hastings algorithm. We explore and generalize an alternative scheme recently introduced in the physics literature (Peters and de With 2012) where the target distribution is explored using a continuous-time nonreversible piecewise-deterministic Markov process. In the Metropolis–Hastings algorithm, a trial move to a region of lower target density, equivalently of higher “energy,” than the current state can be rejected with positive probability. In this alternative approach, a particle moves along straight lines around the space and, when facing a high energy barrier, it is not rejected but its path is modified by bouncing against this barrier. By reformulating this algorithm using inhomogeneous Poisson processes, we exploit standard sampling techniques to simulate exactly this Markov process in a wide range of scenarios of interest. Additionally, when the target distribution is given by a product of factors dependent only on subsets of the state variables, such as the posterior distribution associated with a probabilistic graphical model, this method can be modified to take advantage of this structure by allowing computationally cheaper “local” bounces, which only involve the state variables associated with a factor, while the other state variables keep on evolving. In this context, by leveraging techniques from chemical kinetics, we propose several computationally efficient implementations. Experimentally, this new class of Markov chain Monte Carlo schemes compares favorably to state-of-the-art methods on various Bayesian inference tasks, including for high-dimensional models and large datasets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 855-867 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1294075 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1294075 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:855-867 Template-Type: ReDIF-Article 1.0 Author-Name: Rahul Mukerjee Author-X-Name-First: Rahul Author-X-Name-Last: Mukerjee Author-Name: Tirthankar Dasgupta Author-X-Name-First: Tirthankar Author-X-Name-Last: Dasgupta Author-Name: Donald B. Rubin Author-X-Name-First: Donald B. Author-X-Name-Last: Rubin Title: Using Standard Tools From Finite Population Sampling to Improve Causal Inference for Complex Experiments Abstract: This article considers causal inference for treatment contrasts from a randomized experiment using potential outcomes in a finite population setting. Adopting a Neymanian repeated sampling approach that integrates such causal inference with finite population survey sampling, an inferential framework is developed for general mechanisms of assigning experimental units to multiple treatments. This framework extends classical methods by allowing the possibility of randomization restrictions and unequal replications. Novel conditions that are “milder” than strict additivity of treatment effects, yet permit unbiased estimation of the finite population sampling variance of any treatment contrast estimator, are derived. The consequences of departures from such conditions are also studied under the criterion of minimax bias, and a new justification for using the Neymanian conservative sampling variance estimator in experiments is provided. The proposed approach can readily be extended to the case of treatments with a general factorial structure. Journal: Journal of the American Statistical Association Pages: 868-881 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1294076 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1294076 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:868-881 Template-Type: ReDIF-Article 1.0 Author-Name: Dandan Liu Author-X-Name-First: Dandan Author-X-Name-Last: Liu Author-Name: Tianxi Cai Author-X-Name-First: Tianxi Author-X-Name-Last: Cai Author-Name: Anna Lok Author-X-Name-First: Anna Author-X-Name-Last: Lok Author-Name: Yingye Zheng Author-X-Name-First: Yingye Author-X-Name-Last: Zheng Title: Nonparametric Maximum Likelihood Estimators of Time-Dependent Accuracy Measures for Survival Outcome Under Two-Stage Sampling Designs Abstract: Large prospective cohort studies of rare chronic diseases require thoughtful planning of study designs, especially for biomarker studies when measurements are based on stored tissue or blood specimens. Two-phase designs, including nested case–control and case-cohort sampling designs, provide cost-effective strategies for conducting biomarker evaluation studies.Existing literature for biomarker assessment under two-phase designs largely focuses on simple inverse probability weighting (IPW) estimators. Drawing on recent theoretical development on the maximum likelihood estimators for relative risk parameters in two-phase studies, we propose nonparametric maximum likelihood-based estimators to evaluate the accuracy and predictiveness of a risk prediction biomarker under both types of two-phase designs. In addition, hybrid estimators that combine IPW estimators and maximum likelihood estimation procedure are proposed to improve efficiency and alleviate computational burden. We derive large sample properties of proposed estimators and evaluate their finite sample performance using numerical studies. We illustrate new procedures using a two-phase biomarker study aiming to evaluate the accuracy of a novel biomarker, des-γ-carboxy prothrombin, for early detection of hepatocellular carcinoma. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 882-892 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1295866 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1295866 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:882-892 Template-Type: ReDIF-Article 1.0 Author-Name: Kin Yau Wong Author-X-Name-First: Kin Yau Author-X-Name-Last: Wong Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: D. Y. Lin Author-X-Name-First: D. Y. Author-X-Name-Last: Lin Title: Efficient Estimation for Semiparametric Structural Equation Models With Censored Data Abstract: Structural equation modeling is commonly used to capture complex structures of relationships among multiple variables, both latent and observed. We propose a general class of structural equation models with a semiparametric component for potentially censored survival times. We consider nonparametric maximum likelihood estimation and devise a combined expectation-maximization and Newton-Raphson algorithm for its implementation. We establish conditions for model identifiability and prove the consistency, asymptotic normality, and semiparametric efficiency of the estimators. Finally, we demonstrate the satisfactory performance of the proposed methods through simulation studies and provide an application to a motivating cancer study that contains a variety of genomic variables. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 893-905 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1299626 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1299626 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:893-905 Template-Type: ReDIF-Article 1.0 Author-Name: Ori Davidov Author-X-Name-First: Ori Author-X-Name-Last: Davidov Author-Name: Casey M. Jelsema Author-X-Name-First: Casey M. Author-X-Name-Last: Jelsema Author-Name: Shyamal Peddada Author-X-Name-First: Shyamal Author-X-Name-Last: Peddada Title: Testing for Inequality Constraints in Singular Models by Trimming or Winsorizing the Variance Matrix Abstract: There are many applications in which a statistic follows, at least asymptotically, a normal distribution with a singular or nearly singular variance matrix. A classic example occurs in linear regression models under multicollinearity but there are many more such examples. There is well-developed theory for testing linear equality constraints when the alternative is two-sided and the variance matrix is either singular or nonsingular. In recent years, there is considerable, and growing, interest in developing methods for situations in which the estimated variance matrix is nearly singular. However, there is no corresponding methodology for addressing one-sided, that is, constrained or ordered alternatives. In this article, we develop a unified framework for analyzing such problems. Our approach may be viewed as the trimming or winsorizing of the eigenvalues of the corresponding variance matrix. The proposed methodology is applicable to a wide range of scientific problems and to a variety of statistical models in which inequality constraints arise. We illustrate the methodology using data from a gene expression microarray experiment obtained from the NIEHS’ Fibroid Growth Study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 906-918 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1301258 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1301258 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:906-918 Template-Type: ReDIF-Article 1.0 Author-Name: Jia Chen Author-X-Name-First: Jia Author-X-Name-Last: Chen Author-Name: Degui Li Author-X-Name-First: Degui Author-X-Name-Last: Li Author-Name: Oliver Linton Author-X-Name-First: Oliver Author-X-Name-Last: Linton Author-Name: Zudi Lu Author-X-Name-First: Zudi Author-X-Name-Last: Lu Title: Semiparametric Ultra-High Dimensional Model Averaging of Nonlinear Dynamic Time Series Abstract: We propose two semiparametric model averaging schemes for nonlinear dynamic time series regression models with a very large number of covariates including exogenous regressors and auto-regressive lags. Our objective is to obtain more accurate estimates and forecasts of time series by using a large number of conditioning variables in a nonparametric way. In the first scheme, we introduce a kernel sure independence screening (KSIS) technique to screen out the regressors whose marginal regression (or autoregression) functions do not make a significant contribution to estimating the joint multivariate regression function; we then propose a semiparametric penalized method of model averaging marginal regression (MAMAR) for the regressors and auto-regressors that survive the screening procedure, to further select the regressors that have significant effects on estimating the multivariate regression function and predicting the future values of the response variable. In the second scheme, we impose an approximate factor modeling structure on the ultra-high dimensional exogenous regressors and use the principal component analysis to estimate the latent common factors; we then apply the penalized MAMAR method to select the estimated common factors and the lags of the response variable that are significant. In each of the two schemes, we construct the optimal combination of the significant marginal regression and autoregression functions. Asymptotic properties for these two schemes are derived under some regularity conditions. Numerical studies including both simulation and an empirical application to forecasting inflation are given to illustrate the proposed methodology. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 919-932 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1302339 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1302339 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:919-932 Template-Type: ReDIF-Article 1.0 Author-Name: Peter Ganong Author-X-Name-First: Peter Author-X-Name-Last: Ganong Author-Name: Simon Jäger Author-X-Name-First: Simon Author-X-Name-Last: Jäger Title: A Permutation Test for the Regression Kink Design Abstract: The regression kink (RK) design is an increasingly popular empirical method for estimating causal effects of policies, such as the effect of unemployment benefits on unemployment duration. Using simulation studies based on data from existing RK designs, we empirically document that the statistical significance of RK estimators based on conventional standard errors can be spurious. In the simulations, false positives arise as a consequence of nonlinearities in the underlying relationship between the outcome and the assignment variable, confirming concerns about the misspecification bias of discontinuity estimators pointed out by Calonico, Cattaneo, and Titiunik. As a complement to standard RK inference, we propose that researchers construct a distribution of placebo estimates in regions with and without a policy kink and use this distribution to gauge statistical significance. Under the assumption that the location of the kink point is random, this permutation test has exact size in finite samples for testing a sharp null hypothesis of no effect of the policy on the outcome. We implement simulation studies based on existing RK applications that estimate the effect of unemployment benefits on unemployment duration and show that our permutation test as well as inference procedures proposed by Calonico, Cattaneo, and Titiunik improve upon the size of standard approaches, while having sufficient power to detect an effect of unemployment benefits on unemployment duration. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 494-504 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1328356 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1328356 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:494-504 Template-Type: ReDIF-Article 1.0 Author-Name: Zhigang Yao Author-X-Name-First: Zhigang Author-X-Name-Last: Yao Author-Name: Ye Zhang Author-X-Name-First: Ye Author-X-Name-Last: Zhang Author-Name: Zhidong Bai Author-X-Name-First: Zhidong Author-X-Name-Last: Bai Author-Name: William F. Eddy Author-X-Name-First: William F. Author-X-Name-Last: Eddy Title: Estimating the Number of Sources in Magnetoencephalography Using Spiked Population Eigenvalues Abstract: Magnetoencephalography (MEG) is an advanced imaging technique used to measure the magnetic fields outside the human head produced by the electrical activity inside the brain. Various source localization methods in MEG require the knowledge of the underlying active sources, which are identified by a priori. Common methods used to estimate the number of sources include principal component analysis or information criterion methods, both of which make use of the eigenvalue distribution of the data, thus avoiding solving the time-consuming inverse problem. Unfortunately, all these methods are very sensitive to the signal-to-noise ratio (SNR), as examining the sample extreme eigenvalues does not necessarily reflect the perturbation of the population ones. To uncover the unknown sources from the very noisy MEG data, we introduce a framework, referred to as the intrinsic dimensionality (ID) of the optimal transformation for the SNR rescaling functional. It is defined as the number of the spiked population eigenvalues of the associated transformed data matrix. It is shown that the ID yields a more reasonable estimate for the number of sources than its sample counterparts, especially when the SNR is small. By means of examples, we illustrate that the new method is able to capture the number of signal sources in MEG that can escape PCA or other information criterion-based methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 505-518 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1341411 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1341411 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:505-518 Template-Type: ReDIF-Article 1.0 Author-Name: Xi Chen Author-X-Name-First: Xi Author-X-Name-Last: Chen Author-Name: Kaoru Irie Author-X-Name-First: Kaoru Author-X-Name-Last: Irie Author-Name: David Banks Author-X-Name-First: David Author-X-Name-Last: Banks Author-Name: Robert Haslinger Author-X-Name-First: Robert Author-X-Name-Last: Haslinger Author-Name: Jewell Thomas Author-X-Name-First: Jewell Author-X-Name-Last: Thomas Author-Name: Mike West Author-X-Name-First: Mike Author-X-Name-Last: West Title: Scalable Bayesian Modeling, Monitoring, and Analysis of Dynamic Network Flow Data Abstract: Traffic flow count data in networks arise in many applications, such as automobile or aviation transportation, certain directed social network contexts, and Internet studies. Using an example of Internet browser traffic flow through site-segments of an international news website, we present Bayesian analyses of two linked classes of models which, in tandem, allow fast, scalable, and interpretable Bayesian inference. We first develop flexible state-space models for streaming count data, able to adaptively characterize and quantify network dynamics efficiently in real-time. We then use these models as emulators of more structured, time-varying gravity models that allow formal dissection of network dynamics. This yields interpretable inferences on traffic flow characteristics, and on dynamics in interactions among network nodes. Bayesian monitoring theory defines a strategy for sequential model assessment and adaptation in cases when network flow data deviate from model-based predictions. Exploratory and sequential monitoring analyses of evolving traffic on a network of web site-segments in e-commerce demonstrate the utility of this coupled Bayesian emulation approach to analysis of streaming network count data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 519-533 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1345742 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1345742 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:519-533 Template-Type: ReDIF-Article 1.0 Author-Name: Bruce J. Swihart Author-X-Name-First: Bruce J. Author-X-Name-Last: Swihart Author-Name: Michael P. Fay Author-X-Name-First: Michael P. Author-X-Name-Last: Fay Author-Name: Kazutoyo Miura Author-X-Name-First: Kazutoyo Author-X-Name-Last: Miura Title: Statistical Methods for Standard Membrane-Feeding Assays to Measure Transmission Blocking or Reducing Activity in Malaria Abstract: Transmission blocking vaccines for malaria are not designed to directly protect vaccinated people from malaria disease, but to reduce the probability of infecting other people by interfering with the growth of the malaria parasite in mosquitoes. Standard membrane-feeding assays compare the growth of parasites in mosquitoes from a test sample (using antibodies from a vaccinated person) compared to a control sample. There is debate about whether to estimate the transmission reducing activity (TRA) which compares the mean number of parasites between test and control samples, or transmission blocking activity (TBA) which compares the proportion of infected mosquitoes. TBA appears biologically more important since each mosquito with any parasites is potentially infective; however, TBA is less reproducible and may be an overly strict criterion for screening vaccine candidates. Through a statistical model, we show that the TBA estimand depends on μc, the mean number of parasites in the control mosquitoes, a parameter not easily experimentally controlled. We develop a standardized TBA estimator based on the model and a given target value for μc which has better mean squared error than alternative methods. We discuss types of statistical inference needed for using these assays for vaccine development. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 534-545 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1356313 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356313 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:534-545 Template-Type: ReDIF-Article 1.0 Author-Name: Kun Chen Author-X-Name-First: Kun Author-X-Name-Last: Chen Author-Name: Neha Mishra Author-X-Name-First: Neha Author-X-Name-Last: Mishra Author-Name: Joan Smyth Author-X-Name-First: Joan Author-X-Name-Last: Smyth Author-Name: Haim Bar Author-X-Name-First: Haim Author-X-Name-Last: Bar Author-Name: Elizabeth Schifano Author-X-Name-First: Elizabeth Author-X-Name-Last: Schifano Author-Name: Lynn Kuo Author-X-Name-First: Lynn Author-X-Name-Last: Kuo Author-Name: Ming-Hui Chen Author-X-Name-First: Ming-Hui Author-X-Name-Last: Chen Title: A Tailored Multivariate Mixture Model for Detecting Proteins of Concordant Change Among Virulent Strains of Clostridium Perfringens Abstract: Necrotic enteritis (NE) is a serious disease of poultry caused by the bacterium C. perfringens. To identify proteins of C. perfringens that confer virulence with respect to NE, the protein secretions of four NE disease-producing strains and one baseline nondisease-producing strain of C. perfringens were examined. The problem then becomes a clustering task, for the identification of two extreme groups of proteins that were produced at either concordantly higher or concordantly lower levels across all four disease-producing strains compared to the baseline, when most of the proteins do not exhibit significant change across all strains. However, the existence of some nuisance proteins of discordant change may severely distort any biologically meaningful cluster pattern. We develop a tailored multivariate clustering approach to robustly identify the proteins of concordant change. Using a three-component normal mixture model as the skeleton, our approach incorporates several constraints to account for biological expectations and data characteristics. More importantly, we adopt a sparse mean-shift parameterization in the reference distribution, coupled with a regularized estimation approach, to flexibly accommodate proteins of discordant change. We explore the connections and differences between our approach and other robust clustering methods, and resolve the issue of unbounded likelihood under an eigenvalue-ratio condition. Simulation studies demonstrate the superior performance of our method compared with a number of alternative approaches. Our protein analysis along with further biological investigations may shed light on the discovery of the complete set of virulence factors in NE. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 546-559 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1356314 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356314 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:546-559 Template-Type: ReDIF-Article 1.0 Author-Name: Li Hsu Author-X-Name-First: Li Author-X-Name-Last: Hsu Author-Name: Malka Gorfine Author-X-Name-First: Malka Author-X-Name-Last: Gorfine Author-Name: David Zucker Author-X-Name-First: David Author-X-Name-Last: Zucker Title: On Estimation of the Hazard Function From Population-Based Case–Control Studies Abstract: The population-based case–control study design has been widely used for studying the etiology of chronic diseases. It is well established that the Cox proportional hazards model can be adapted to the case–control study and hazard ratios can be estimated by (conditional) logistic regression model with time as either a matched set or a covariate. However, the baseline hazard function, a critical component in absolute risk assessment, is unidentifiable, because the ratio of cases and controls is controlled by the investigators and does not reflect the true disease incidence rate in the population. In this article, we propose a simple and innovative approach, which makes use of routinely collected family history information, to estimate the baseline hazard function for any logistic regression model that is fit to the risk factor data collected on cases and controls. We establish that the proposed baseline hazard function estimator is consistent and asymptotically normal and show via simulation that it performs well in finite samples. We illustrate the proposed method by a population-based case–control study of prostate cancer where the association of various risk factors is assessed and the family history information is used to estimate the baseline hazard function. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 560-570 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1356315 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356315 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:560-570 Template-Type: ReDIF-Article 1.0 Author-Name: Haiming Zhou Author-X-Name-First: Haiming Author-X-Name-Last: Zhou Author-Name: Timothy Hanson Author-X-Name-First: Timothy Author-X-Name-Last: Hanson Title: A Unified Framework for Fitting Bayesian Semiparametric Models to Arbitrarily Censored Survival Data, Including Spatially Referenced Data Abstract: A comprehensive, unified approach to modeling arbitrarily censored spatial survival data is presented for the three most commonly used semiparametric models: proportional hazards, proportional odds, and accelerated failure time. Unlike many other approaches, all manner of censored survival times are simultaneously accommodated including uncensored, interval censored, current-status, left and right censored, and mixtures of these. Left-truncated data are also accommodated leading to models for time-dependent covariates. Both georeferenced (location exactly observed) and areally observed (location known up to a geographic unit such as a county) spatial locations are handled; formal variable selection makes model selection especially easy. Model fit is assessed with conditional Cox–Snell residual plots, and model choice is carried out via log pseudo marginal likelihood (LPML) and deviance information criterion (DIC). Baseline survival is modeled with a novel transformed Bernstein polynomial prior. All models are fit via a new function which calls efficient compiled C++ in the R package spBayesSurv. The methodology is broadly illustrated with simulations and real data applications. An important finding is that proportional odds and accelerated failure time models often fit significantly better than the commonly used proportional hazards model. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 571-581 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1356316 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356316 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:571-581 Template-Type: ReDIF-Article 1.0 Author-Name: Liang Li Author-X-Name-First: Liang Author-X-Name-Last: Li Author-Name: Chih-Hsien Wu Author-X-Name-First: Chih-Hsien Author-X-Name-Last: Wu Author-Name: Jing Ning Author-X-Name-First: Jing Author-X-Name-Last: Ning Author-Name: Xuelin Huang Author-X-Name-First: Xuelin Author-X-Name-Last: Huang Author-Name: Ya-Chen Tina Shih Author-X-Name-First: Ya-Chen Tina Author-X-Name-Last: Shih Author-Name: Yu Shen Author-X-Name-First: Yu Author-X-Name-Last: Shen Title: Semiparametric Estimation of Longitudinal Medical Cost Trajectory Abstract: Estimating the average monthly medical costs from disease diagnosis to a terminal event such as death for an incident cohort of patients is a topic of immense interest to researchers in health policy and health economics because patterns of average monthly costs over time reveal how medical costs vary across phases of care. The statistical challenges to estimating monthly medical costs longitudinally are multifold; the longitudinal cost trajectory (formed by plotting the average monthly costs from diagnosis to the terminal event) is likely to be nonlinear, with its shape depending on the time of the terminal event, which can be subject to right censoring. The goal of this article is to tackle this statistically challenging topic by estimating the conditional mean cost at any month t given the time of the terminal event s. The longitudinal cost trajectories with different terminal event times form a bivariate surface of t and s, under the constraint t ⩽ s. We propose to estimate this surface using bivariate penalized splines in an expectation-maximization algorithm that treats the censored terminal event times as missing data. We evaluate the proposed model and estimation method in simulations and apply the method to the medical cost data of an incident cohort of stage IV breast cancer patients from the Surveillance, Epidemiology, and End Results–Medicare Linked Database. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 582-592 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1361329 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1361329 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:582-592 Template-Type: ReDIF-Article 1.0 Author-Name: Yuhang Xu Author-X-Name-First: Yuhang Author-X-Name-Last: Xu Author-Name: Yehua Li Author-X-Name-First: Yehua Author-X-Name-Last: Li Author-Name: Dan Nettleton Author-X-Name-First: Dan Author-X-Name-Last: Nettleton Title: Nested Hierarchical Functional Data Modeling and Inference for the Analysis of Functional Plant Phenotypes Abstract: In a plant science Root Image Study, the process of seedling roots bending in response to gravity is recorded using digital cameras, and the bending rates are modeled as functional plant phenotype data. The functional phenotypes are collected from seeds representing a large variety of genotypes and have a three-level nested hierarchical structure, with seeds nested in groups nested in genotypes. The seeds are imaged on different days of the lunar cycle, and an important scientific question is whether there are lunar effects on root bending. We allow the mean function of the bending rate to depend on the lunar day and model the phenotypic variation between genotypes, groups of seeds imaged together, and individual seeds by hierarchical functional random effects. We estimate the covariance functions of the functional random effects by a fast penalized tensor product spline approach, perform multi-level functional principal component analysis (FPCA) using the best linear unbiased predictor of the principal component scores, and improve the efficiency of mean estimation by iterative decorrelation. We choose the number of principal components using a conditional Akaike information criterion and test the lunar day effect using generalized likelihood ratio test statistics based on the marginal and conditional likelihoods. We also propose a permutation procedure to evaluate the null distribution of the test statistics. Our simulation studies show that our model selection criterion selects the correct number of principal components with remarkably high frequency, and the likelihood-based tests based on FPCA have higher power than a test based on working independence. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 593-606 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2017.1366907 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1366907 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:593-606 Template-Type: ReDIF-Article 1.0 Author-Name: Sonja A. Swanson Author-X-Name-First: Sonja A. Author-X-Name-Last: Swanson Author-Name: Miguel A. Hernán Author-X-Name-First: Miguel A. Author-X-Name-Last: Hernán Author-Name: Matthew Miller Author-X-Name-First: Matthew Author-X-Name-Last: Miller Author-Name: James M. Robins Author-X-Name-First: James M. Author-X-Name-Last: Robins Author-Name: Thomas S. Richardson Author-X-Name-First: Thomas S. Author-X-Name-Last: Richardson Title: Partial Identification of the Average Treatment Effect Using Instrumental Variables: Review of Methods for Binary Instruments, Treatments, and Outcomes Abstract: Several methods have been proposed for partially or point identifying the average treatment effect (ATE) using instrumental variable (IV) type assumptions. The descriptions of these methods are widespread across the statistical, economic, epidemiologic, and computer science literature, and the connections between the methods have not been readily apparent. In the setting of a binary instrument, treatment, and outcome, we review proposed methods for partial and point identification of the ATE under IV assumptions, express the identification results in a common notation and terminology, and propose a taxonomy that is based on sets of identifying assumptions. We further demonstrate and provide software for the application of these methods to estimate bounds. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 933-947 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2018.1434530 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1434530 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:933-947 Template-Type: ReDIF-Article 1.0 Author-Name: Houshmand Shirani-Mehr Author-X-Name-First: Houshmand Author-X-Name-Last: Shirani-Mehr Author-Name: David Rothschild Author-X-Name-First: David Author-X-Name-Last: Rothschild Author-Name: Sharad Goel Author-X-Name-First: Sharad Author-X-Name-Last: Goel Author-Name: Andrew Gelman Author-X-Name-First: Andrew Author-X-Name-Last: Gelman Title: Disentangling Bias and Variance in Election Polls Abstract: It is well known among researchers and practitioners that election polls suffer from a variety of sampling and nonsampling errors, often collectively referred to as total survey error. Reported margins of error typically only capture sampling variability, and in particular, generally ignore nonsampling errors in defining the target population (e.g., errors due to uncertainty in who will vote). Here, we empirically analyze 4221 polls for 608 state-level presidential, senatorial, and gubernatorial elections between 1998 and 2014, all of which were conducted during the final three weeks of the campaigns. Comparing to the actual election outcomes, we find that average survey error as measured by root mean square error is approximately 3.5 percentage points, about twice as large as that implied by most reported margins of error. We decompose survey error into election-level bias and variance terms. We find that average absolute election-level bias is about 2 percentage points, indicating that polls for a given election often share a common component of error. This shared error may stem from the fact that polling organizations often face similar difficulties in reaching various subgroups of the population, and that they rely on similar screening rules when estimating who will vote. We also find that average election-level variance is higher than implied by simple random sampling, in part because polling organizations often use complex sampling designs and adjustment procedures. We conclude by discussing how these results help explain polling failures in the 2016 U.S. presidential election, and offer recommendations to improve polling practice. Journal: Journal of the American Statistical Association Pages: 607-614 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2018.1448823 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1448823 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:607-614 Template-Type: ReDIF-Article 1.0 Author-Name: Barry D. Nussbaum Author-X-Name-First: Barry D. Author-X-Name-Last: Nussbaum Title: Statistics: Essential Now More Than Ever Abstract: Each year, the Journal of the American Statistical Association publishes the presidential address from the Joint Statistical Meetings. Here, we present the 2017 address verbatim save for the addition of references and a few minor editorial corrections. Journal: Journal of the American Statistical Association Pages: 489-493 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2018.1463486 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1463486 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:489-493 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Book Reviews Journal: Journal of the American Statistical Association Pages: 948-953 Issue: 522 Volume: 113 Year: 2018 Month: 4 X-DOI: 10.1080/01621459.2018.1486071 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1486071 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:948-953 Template-Type: ReDIF-Article 1.0 Author-Name: Karun Adusumilli Author-X-Name-First: Karun Author-X-Name-Last: Adusumilli Author-Name: Taisuke Otsu Author-X-Name-First: Taisuke Author-X-Name-Last: Otsu Title: Empirical Likelihood for Random Sets Abstract: In many statistical applications, the observed data take the form of sets rather than points. Examples include bracket data in survey analysis, tumor growth and rock grain images in morphology analysis, and noisy measurements on the support function of a convex set in medical imaging and robotic vision. Additionally, in studies of treatment effects, researchers often wish to conduct inference on nonparametric bounds for the effects which can be expressed by means of random sets. This article develops the concept of nonparametric likelihood for random sets and its mean, known as the Aumann expectation, and proposes general inference methods by adapting the theory of empirical likelihood. Several examples, such as regression with bracket income data, Boolean models for tumor growth, bound analysis on treatment effects, and image analysis via support functions, illustrate the usefulness of the proposed methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1064-1075 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1188107 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1188107 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1064-1075 Template-Type: ReDIF-Article 1.0 Author-Name: Kin Wai Chan Author-X-Name-First: Kin Wai Author-X-Name-Last: Chan Author-Name: Chun Yip Yau Author-X-Name-First: Chun Yip Author-X-Name-Last: Yau Title: Automatic Optimal Batch Size Selection for Recursive Estimators of Time-Average Covariance Matrix Abstract: The time-average covariance matrix (TACM) Σ:=∑k∈ZΓk$\bm {\Sigma }:=\sum _{k\in \mathbb {Z}}\bm {\Gamma }_k$, where Γk is the auto-covariance function, is an important quantity for the inference of the mean of an Rd$\mathbb {R}^d$-valued stationary process (d ⩾ 1). This article proposes two recursive estimators for Σ with optimal asymptotic mean square error (AMSE) under different strengths of serial dependence. The optimal estimator involves a batch size selection, which requires knowledge of a smoothness parameter ϒβ:=∑k∈Z|k|βΓk$\bm {\Upsilon }_{\beta }:=\sum _{k\in \mathbb {Z}} |k|^{\beta } \bm {\Gamma }_k$, for some β. This article also develops recursive estimators for ϒβ. Combining these two estimators, we obtain a fully automatic procedure for optimal online estimation for Σ. Consistency and convergence rates of the proposed estimators are derived. Applications to confidence region construction and Markov chain Monte Carlo convergence diagnosis are discussed. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1076-1089 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1189337 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1189337 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1076-1089 Template-Type: ReDIF-Article 1.0 Author-Name: Neil Shephard Author-X-Name-First: Neil Author-X-Name-Last: Shephard Author-Name: Justin J. Yang Author-X-Name-First: Justin J. Author-X-Name-Last: Yang Title: Continuous Time Analysis of Fleeting Discrete Price Moves Abstract: This article proposes a novel model of financial prices where (i) prices are discrete; (ii) prices change in continuous time; (iii) a high proportion of price changes are reversed in a fraction of a second. Our model is analytically tractable and directly formulated in terms of the calendar time and price impact curve. The resulting càdlàg price process is a piecewise constant semimartingale with finite activity, finite variation, and no Brownian motion component. We use moment-based estimations to fit four high-frequency futures datasets and demonstrate the descriptive power of our proposed model. This model is able to describe the observed dynamics of price changes over three different orders of magnitude of time intervals. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1090-1106 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1192544 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1192544 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1090-1106 Template-Type: ReDIF-Article 1.0 Author-Name: Yun Yang Author-X-Name-First: Yun Author-X-Name-Last: Yang Author-Name: Surya T. Tokdar Author-X-Name-First: Surya T. Author-X-Name-Last: Tokdar Title: Joint Estimation of Quantile Planes Over Arbitrary Predictor Spaces Abstract: In spite of the recent surge of interest in quantile regression, joint estimation of linear quantile planes remains a great challenge in statistics and econometrics. We propose a novel parameterization that characterizes any collection of noncrossing quantile planes over arbitrarily shaped convex predictor domains in any dimension by means of unconstrained scalar, vector and function valued parameters. Statistical models based on this parameterization inherit a fast computation of the likelihood function, enabling penalized likelihood or Bayesian approaches to model fitting. We introduce a complete Bayesian methodology by using Gaussian process prior distributions on the function valued parameters and develop a robust and efficient Markov chain Monte Carlo parameter estimation. The resulting method is shown to offer posterior consistency under mild tail and regularity conditions. We present several illustrative examples where the new method is compared against existing approaches and is found to offer better accuracy, coverage and model fit. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1107-1120 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1192545 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1192545 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1107-1120 Template-Type: ReDIF-Article 1.0 Author-Name: Thomas S. Richardson Author-X-Name-First: Thomas S. Author-X-Name-Last: Richardson Author-Name: James M. Robins Author-X-Name-First: James M. Author-X-Name-Last: Robins Author-Name: Linbo Wang Author-X-Name-First: Linbo Author-X-Name-Last: Wang Title: On Modeling and Estimation for the Relative Risk and Risk Difference Abstract: A common problem in formulating models for the relative risk and risk difference is the variation dependence between these parameters and the baseline risk, which is a nuisance model. We address this problem by proposing the conditional log odds-product as a preferred nuisance model. This novel nuisance model facilitates maximum-likelihood estimation, but also permits doubly-robust estimation for the parameters of interest. Our approach is illustrated via simulations and a data analysis. An R package {\tt brm} implementing the proposed methods is available on CRAN. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1121-1130 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1192546 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1192546 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1121-1130 Template-Type: ReDIF-Article 1.0 Author-Name: Lexin Li Author-X-Name-First: Lexin Author-X-Name-Last: Li Author-Name: Xin Zhang Author-X-Name-First: Xin Author-X-Name-Last: Zhang Title: Parsimonious Tensor Response Regression Abstract: Aiming at abundant scientific and engineering data with not only high dimensionality but also complex structure, we study the regression problem with a multidimensional array (tensor) response and a vector predictor. Applications include, among others, comparing tensor images across groups after adjusting for additional covariates, which is of central interest in neuroimaging analysis. We propose parsimonious tensor response regression adopting a generalized sparsity principle. It models all voxels of the tensor response jointly, while accounting for the inherent structural information among the voxels. It effectively reduces the number of free parameters, leading to feasible computation and improved interpretation. We achieve model estimation through a nascent technique called the envelope method, which identifies the immaterial information and focuses the estimation based upon the material information in the tensor response. We demonstrate that the resulting estimator is asymptotically efficient, and it enjoys a competitive finite sample performance. We also illustrate the new method on two real neuroimaging studies. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1131-1146 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1193022 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1193022 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1131-1146 Template-Type: ReDIF-Article 1.0 Author-Name: David Choi Author-X-Name-First: David Author-X-Name-Last: Choi Title: Estimation of Monotone Treatment Effects in Network Experiments Abstract: Randomized experiments on social networks pose statistical challenges, due to the possibility of interference between units. We propose new methods for finding confidence intervals on the attributable treatment effect in such settings. The methods do not require partial interference, but instead require an identifying assumption that is similar to requiring nonnegative treatment effects. Network or spatial information can be used to customize the test statistic; in principle, this can increase power without making assumptions on the data-generating process. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1147-1155 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1194845 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1194845 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1147-1155 Template-Type: ReDIF-Article 1.0 Author-Name: Xiao Wang Author-X-Name-First: Xiao Author-X-Name-Last: Wang Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Author-Name: Author-X-Name-First: Author-X-Name-Last: Title: Generalized Scalar-on-Image Regression Models via Total Variation Abstract: The use of imaging markers to predict clinical outcomes can have a great impact in public health. The aim of this article is to develop a class of generalized scalar-on-image regression models via total variation (GSIRM-TV), in the sense of generalized linear models, for scalar response and imaging predictor with the presence of scalar covariates. A key novelty of GSIRM-TV is that it is assumed that the slope function (or image) of GSIRM-TV belongs to the space of bounded total variation to explicitly account for the piecewise smooth nature of most imaging data. We develop an efficient penalized total variation optimization to estimate the unknown slope function and other parameters. We also establish nonasymptotic error bounds on the excess risk. These bounds are explicitly specified in terms of sample size, image size, and image smoothness. Our simulations demonstrate a superior performance of GSIRM-TV against many existing approaches. We apply GSIRM-TV to the analysis of hippocampus data obtained from the Alzheimers Disease Neuroimaging Initiative (ADNI) dataset. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1156-1168 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1194846 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1194846 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1156-1168 Template-Type: ReDIF-Article 1.0 Author-Name: Jialiang Li Author-X-Name-First: Jialiang Author-X-Name-Last: Li Author-Name: Chao Huang Author-X-Name-First: Chao Author-X-Name-Last: Huang Author-Name: Zhub Hongtu Author-X-Name-First: Zhub Author-X-Name-Last: Hongtu Author-Name: Author-X-Name-First: Author-X-Name-Last: Title: A Functional Varying-Coefficient Single-Index Model for Functional Response Data Abstract: Motivated by the analysis of imaging data, we propose a novel functional varying-coefficient single-index model (FVCSIM) to carry out the regression analysis of functional response data on a set of covariates of interest. FVCSIM represents a new extension of varying-coefficient single-index models for scalar responses collected from cross-sectional and longitudinal studies. An efficient estimation procedure is developed to iteratively estimate varying coefficient functions, link functions, index parameter vectors, and the covariance function of individual functions. We systematically examine the asymptotic properties of all estimators including the weak convergence of the estimated varying coefficient functions, the asymptotic distribution of the estimated index parameter vectors, and the uniform convergence rate of the estimated covariance function and their spectrum. Simulation studies are carried out to assess the finite-sample performance of the proposed procedure. We apply FVCSIM to investigate the development of white matter diffusivities along the corpus callosum skeleton obtained from Alzheimer’s Disease Neuroimaging Initiative (ADNI) study. Supplementary material for this article is available online. Journal: Journal of the American Statistical Association Pages: 1169-1181 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1195742 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1195742 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1169-1181 Template-Type: ReDIF-Article 1.0 Author-Name: Tomohiro Ando Author-X-Name-First: Tomohiro Author-X-Name-Last: Ando Author-Name: Jushan Bai Author-X-Name-First: Jushan Author-X-Name-Last: Bai Title: Clustering Huge Number of Financial Time Series: A Panel Data Approach With High-Dimensional Predictors and Factor Structures Abstract: This article introduces a new procedure for clustering a large number of financial time series based on high-dimensional panel data with grouped factor structures. The proposed method attempts to capture the level of similarity of each of the time series based on sensitivity to observable factors as well as to the unobservable factor structure. The proposed method allows for correlations between observable and unobservable factors and also allows for cross-sectional and serial dependence and heteroscedasticities in the error structure, which are common in financial markets. In addition, theoretical properties are established for the procedure. We apply the method to analyze the returns for over 6000 international stocks from over 100 financial markets. The empirical analysis quantifies the extent to which the U.S. subprime crisis spilled over to the global financial markets. Furthermore, we find that nominal classifications based on either listed market, industry, country or region are insufficient to characterize the heterogeneity of the global financial markets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1182-1198 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1195743 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1195743 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1182-1198 Template-Type: ReDIF-Article 1.0 Author-Name: Simon N. Wood Author-X-Name-First: Simon N. Author-X-Name-Last: Wood Author-Name: Zheyuan Li Author-X-Name-First: Zheyuan Author-X-Name-Last: Li Author-Name: Gavin Shaddick Author-X-Name-First: Gavin Author-X-Name-Last: Shaddick Author-Name: Nicole H. Augustin Author-X-Name-First: Nicole H. Author-X-Name-Last: Augustin Title: Generalized Additive Models for Gigadata: Modeling the U.K. Black Smoke Network Daily Data Abstract: We develop scalable methods for fitting penalized regression spline based generalized additive models with of the order of 104 coefficients to up to 108 data. Computational feasibility rests on: (i) a new iteration scheme for estimation of model coefficients and smoothing parameters, avoiding poorly scaling matrix operations; (ii) parallelization of the iteration’s pivoted block Cholesky and basic matrix operations; (iii) the marginal discretization of model covariates to reduce memory footprint, with efficient scalable methods for computing required crossproducts directly from the discrete representation. Marginal discretization enables much finer discretization than joint discretization would permit. We were motivated by the need to model four decades worth of daily particulate data from the U.K. Black Smoke and Sulphur Dioxide Monitoring Network. Although reduced in size recently, over 2000 stations have at some time been part of the network, resulting in some 10 million measurements. Modeling at a daily scale is desirable for accurate trend estimation and mapping, and to provide daily exposure estimates for epidemiological cohort studies. Because of the dataset size, previous work has focused on modeling time or space averaged pollution levels, but this is unsatisfactory from a health perspective, since it is often acute exposure locally and on the time scale of days that is of most importance in driving adverse health outcomes. If computed by conventional means our black smoke model would require a half terabyte of storage just for the model matrix, whereas we are able to compute with it on a desktop workstation. The best previously available reduced memory footprint method would have required three orders of magnitude more computing time than our new method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1199-1210 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1195744 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1195744 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1199-1210 Template-Type: ReDIF-Article 1.0 Author-Name: Cyrus J. DiCiccio Author-X-Name-First: Cyrus J. Author-X-Name-Last: DiCiccio Author-Name: Joseph P. Romano Author-X-Name-First: Joseph P. Author-X-Name-Last: Romano Title: Robust Permutation Tests For Correlation And Regression Coefficients Abstract: Given a sample from a bivariate distribution, consider the problem of testing independence. A permutation test based on the sample correlation is known to be an exact level α test. However, when used to test the null hypothesis that the samples are uncorrelated, the permutation test can have rejection probability that is far from the nominal level. Further, the permutation test can have a large Type 3 (directional) error rate, whereby there can be a large probability that the permutation test rejects because the sample correlation is a large positive value, when in fact the true correlation is negative. It will be shown that studentizing the sample correlation leads to a permutation test which is exact under independence and asymptotically controls the probability of Type 1 (or Type 3) errors. These conclusions are based on our results describing the almost sure limiting behavior of the randomization distribution. We will also present asymptotically robust randomization tests for regression coefficients, including a result based on a modified procedure of Freedman and Lane. Simulations and empirical applications are included. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1211-1220 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1202117 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1202117 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1211-1220 Template-Type: ReDIF-Article 1.0 Author-Name: Jon Arni Steingrimsson Author-X-Name-First: Jon Arni Author-X-Name-Last: Steingrimsson Author-Name: Robert L. Strawderman Author-X-Name-First: Robert L. Author-X-Name-Last: Strawderman Title: Estimation in the Semiparametric Accelerated Failure Time Model With Missing Covariates: Improving Efficiency Through Augmentation Abstract: This article considers linear regression with missing covariates and a right censored outcome. We first consider a general two-phase outcome sampling design, where full covariate information is only ascertained for subjects in phase two and sampling occurs under an independent Bernoulli sampling scheme with known subject-specific sampling probabilities that depend on phase one information (e.g., survival time, failure status and covariates). The semiparametric information bound is derived for estimating the regression parameter in this setting. We also introduce a more practical class of augmented estimators that is shown to improve asymptotic efficiency over simple but inefficient inverse probability of sampling weighted estimators. Estimation for known sampling weights and extensions to the case of estimated sampling weights are both considered. The allowance for estimated sampling weights permits covariates to be missing at random according to a monotone but unknown mechanism. The asymptotic properties of the augmented estimators are derived and simulation results demonstrate substantial efficiency improvements over simpler inverse probability of sampling weighted estimators in the indicated settings. With suitable modification, the proposed methodology can also be used to improve augmented estimators previously used for missing covariates in a Cox regression model. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1221-1235 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1205500 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1205500 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1221-1235 Template-Type: ReDIF-Article 1.0 Author-Name: Ganggang Xu Author-X-Name-First: Ganggang Author-X-Name-Last: Xu Author-Name: Marc G. Genton Author-X-Name-First: Marc G. Author-X-Name-Last: Genton Title: Tukey -and- Random Fields Abstract: We propose a new class of transGaussian random fields named Tukey g-and-h (TGH) random fields to model non-Gaussian spatial data. The proposed TGH random fields have extremely flexible marginal distributions, possibly skewed and/or heavy-tailed, and, therefore, have a wide range of applications. The special formulation of the TGH random field enables an automatic search for the most suitable transformation for the dataset of interest while estimating model parameters. Asymptotic properties of the maximum likelihood estimator and the probabilistic properties of the TGH random fields are investigated. An efficient estimation procedure, based on maximum approximated likelihood, is proposed and an extreme spatial outlier detection algorithm is formulated. Kriging and probabilistic prediction with TGH random fields are developed along with prediction confidence intervals. The predictive performance of TGH random fields is demonstrated through extensive simulation studies and an application to a dataset of total precipitation in the south east of the United States. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1236-1249 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1205501 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1205501 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1236-1249 Template-Type: ReDIF-Article 1.0 Author-Name: Pengfei Li Author-X-Name-First: Pengfei Author-X-Name-Last: Li Author-Name: Yukun Liu Author-X-Name-First: Yukun Author-X-Name-Last: Liu Author-Name: Jing Qin Author-X-Name-First: Jing Author-X-Name-Last: Qin Title: Semiparametric Inference in a Genetic Mixture Model Abstract: In genetic backcross studies, data are often collected from complex mixtures of distributions with known mixing proportions. Previous approaches to the inference of these genetic mixture models involve parameterizing the component distributions. However, model misspecification of any form is expected to have detrimental effects. We propose a semiparametric likelihood method for genetic mixture models: the empirical likelihood under the exponential tilting model assumption, in which the log ratio of the probability (density) functions from the components is linear in the observations. An application to mice cancer genetics involves random numbers of offspring within a litter. In other words, the cluster size is a random variable. We wish to test the null hypothesis that there is no difference between the two components in the mixture model, but unfortunately we find that the Fisher information is degenerate. As a consequence, the conventional two-term expansion in the likelihood ratio statistic does not work. By using a higher-order expansion, we are able to establish a nonstandard convergence rate N− 1/4 for the odds ratio parameter estimator β^$\hat{\beta }$. Moreover, the limiting distribution of the empirical likelihood ratio statistic is derived. The underlying distribution function of each component can also be estimated semiparametrically. Analogously to the full parametric approach, we develop an expectation and maximization algorithm for finding the semiparametric maximum likelihood estimator. Simulation results and a real cancer application indicate that the proposed semiparametric method works much better than parametric methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1250-1260 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1208614 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1208614 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1250-1260 Template-Type: ReDIF-Article 1.0 Author-Name: Lizhen Lin Author-X-Name-First: Lizhen Author-X-Name-Last: Lin Author-Name: Brian St. Thomas Author-X-Name-First: Brian Author-X-Name-Last: St. Thomas Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Extrinsic Local Regression on Manifold-Valued Data Abstract: We propose an extrinsic regression framework for modeling data with manifold valued responses and Euclidean predictors. Regression with manifold responses has wide applications in shape analysis, neuroscience, medical imaging, and many other areas. Our approach embeds the manifold where the responses lie onto a higher dimensional Euclidean space, obtains a local regression estimate in that space, and then projects this estimate back onto the image of the manifold. Outside the regression setting both intrinsic and extrinsic approaches have been proposed for modeling iid manifold-valued data. However, to our knowledge our work is the first to take an extrinsic approach to the regression problem. The proposed extrinsic regression framework is general, computationally efficient, and theoretically appealing. Asymptotic distributions and convergence rates of the extrinsic regression estimates are derived and a large class of examples is considered indicating the wide applicability of our approach. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1261-1273 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1208615 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1208615 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1261-1273 Template-Type: ReDIF-Article 1.0 Author-Name: Matthew Plumlee Author-X-Name-First: Matthew Author-X-Name-Last: Plumlee Title: Bayesian Calibration of Inexact Computer Models Abstract: Bayesian calibration is used to study computer models in the presence of both a calibration parameter and model bias. The parameter in the predominant methodology is left undefined. This results in an issue, where the posterior of the parameter is suboptimally broad. There has been no generally accepted alternatives to date. This article proposes using Bayesian calibration, where the prior distribution on the bias is orthogonal to the gradient of the computer model. Problems associated with Bayesian calibration are shown to be mitigated through analytic results in addition to examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1274-1285 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1211016 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1211016 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1274-1285 Template-Type: ReDIF-Article 1.0 Author-Name: Kyle Vincent Author-X-Name-First: Kyle Author-X-Name-Last: Vincent Author-Name: Steve Thompson Author-X-Name-First: Steve Author-X-Name-Last: Thompson Title: Estimating Population Size With Link-Tracing Sampling Abstract: We present a new design and method for estimating the size of a hidden population best reached through a link-tracing design. The design is based on selecting initial samples at random and then adaptively tracing links to add new members. The inferential procedure involves the Rao–Blackwell theorem applied to a sufficient statistic markedly different from the usual one that arises in sampling from a finite population. The strategy involves a combination of link-tracing and mark-recapture estimation methods. An empirical application is described. The result demonstrates that the strategy can efficiently incorporate adaptively selected members of the sample into the inferential procedure. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1286-1295 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1212712 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1212712 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1286-1295 Template-Type: ReDIF-Article 1.0 Author-Name: Ming-Yueh Huang Author-X-Name-First: Ming-Yueh Author-X-Name-Last: Huang Author-Name: Chin-Tsang Chiang Author-X-Name-First: Chin-Tsang Author-X-Name-Last: Chiang Title: An Effective Semiparametric Estimation Approach for the Sufficient Dimension Reduction Model Abstract: In the exploratory data analysis, the sufficient dimension reduction model has been widely used to characterize the conditional distribution of interest. Different from the existing approaches, our main achievement is to simultaneously estimate two essential elements, basis and structural dimension, of the central subspace and the bandwidth of a kernel distribution estimator through a single estimation criterion. With an appropriate order of kernel function, the proposed estimation procedure can be effectively carried out by starting with a dimension of zero until the first local minimum is reached. Meanwhile, the optimal bandwidth selector is ensured to be a valid tuning parameter for the central subspace estimator. An important advantage of this estimation technique is its flexibility to allow a response to be discrete and some of covariates to be discrete or categorical providing that a certain continuity condition holds. Under very mild assumptions, we further derive the uniform consistency of the introduced optimization function and the consistency of the resulting estimators. Moreover, the asymptotic normality of the central subspace estimator is established with an estimated rather than exact structural dimension. In extensive simulations, the developed approach generally outperforms the competitors. Data from previous studies are also used to illustrate the proposal. On the whole, our methodology is very effective in estimating the central subspace and conditional distribution, highly flexible in adapting diverse types of a response and covariates, and practically feasible in obtaining an asymptotically optimal and valid bandwidth estimator. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1296-1310 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1215987 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1215987 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1296-1310 Template-Type: ReDIF-Article 1.0 Author-Name: Kazuki Uematsu Author-X-Name-First: Kazuki Author-X-Name-Last: Uematsu Author-Name: Yoonkyung Lee Author-X-Name-First: Yoonkyung Author-X-Name-Last: Lee Title: On Theoretically Optimal Ranking Functions in Bipartite Ranking Abstract: This article investigates the theoretical relation between loss criteria and the optimal ranking functions driven by the criteria in bipartite ranking. In particular, the relation between area under the ROC curve (AUC) maximization and minimization of ranking risk under a convex loss is examined. We characterize general conditions for ranking-calibrated loss functions in a pairwise approach, and show that the best ranking functions under convex ranking-calibrated loss criteria produce the same ordering as the likelihood ratio of the positive category to the negative category over the instance space. The result illuminates the parallel between ranking and classification in general, and suggests the notion of consistency in ranking when convex ranking risk is minimized as in the RankBoost algorithm for instance. For a certain class of loss functions including the exponential loss and the binomial deviance, we specify the optimal ranking function explicitly in relation to the underlying probability distribution. In addition, we present an in-depth analysis of hinge loss optimization for ranking and point out that the RankSVM may produce potentially many ties or granularity in ranking scores due to the singularity of the hinge loss, which could result in ranking inconsistency. The theoretical findings are illustrated with numerical examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1311-1322 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1215988 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1215988 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1311-1322 Template-Type: ReDIF-Article 1.0 Author-Name: Francis K. C. Hui Author-X-Name-First: Francis K. C. Author-X-Name-Last: Hui Author-Name: Samuel Müller Author-X-Name-First: Samuel Author-X-Name-Last: Müller Author-Name: A. H. Welsh Author-X-Name-First: A. H. Author-X-Name-Last: Welsh Title: Joint Selection in Mixed Models using Regularized PQL Abstract: The application of generalized linear mixed models presents some major challenges for both estimation, due to the intractable marginal likelihood, and model selection, as we usually want to jointly select over both fixed and random effects. We propose to overcome these challenges by combining penalized quasi-likelihood (PQL) estimation with sparsity inducing penalties on the fixed and random coefficients. The resulting approach, referred to as regularized PQL, is a computationally efficient method for performing joint selection in mixed models. A key aspect of regularized PQL involves the use of a group based penalty for the random effects: sparsity is induced such that all the coefficients for a random effect are shrunk to zero simultaneously, which in turn leads to the random effect being removed from the model. Despite being a quasi-likelihood approach, we show that regularized PQL is selection consistent, that is, it asymptotically selects the true set of fixed and random effects, in the setting where the cluster size grows with the number of clusters. Furthermore, we propose an information criterion for choosing the single tuning parameter and show that it facilitates selection consistency. Simulations demonstrate regularized PQL outperforms several currently employed methods for joint selection even if the cluster size is small compared to the number of clusters, while also offering dramatic reductions in computation time. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1323-1333 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1215989 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1215989 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1323-1333 Template-Type: ReDIF-Article 1.0 Author-Name: Ulrich K. Müller Author-X-Name-First: Ulrich K. Author-X-Name-Last: Müller Author-Name: Yulong Wang Author-X-Name-First: Yulong Author-X-Name-Last: Wang Title: Fixed- Asymptotic Inference About Tail Properties Abstract: We consider inference about tail properties of a distribution from an iid sample, based on extreme value theory. All of the numerous previous suggestions rely on asymptotics where eventually, an infinite number of observations from the tail behave as predicted by extreme value theory, enabling the consistent estimation of the key tail index, and the construction of confidence intervals using the delta method or other classic approaches. In small samples, however, extreme value theory might well provide good approximations for only a relatively small number of tail observations. To accommodate this concern, we develop asymptotically valid confidence intervals for high quantile and tail conditional expectations that only require extreme value theory to hold for the largest k observations, for a given and fixed k. Small-sample simulations show that these “fixed-k” intervals have excellent small-sample coverage properties, and we illustrate their use with mainland U.S. hurricane data. In addition, we provide an analytical result about the additional asymptotic robustness of the fixed-k approach compared to kn → ∞ inference. Journal: Journal of the American Statistical Association Pages: 1334-1343 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1215990 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1215990 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1334-1343 Template-Type: ReDIF-Article 1.0 Author-Name: Xuan Bi Author-X-Name-First: Xuan Author-X-Name-Last: Bi Author-Name: Annie Qu Author-X-Name-First: Annie Author-X-Name-Last: Qu Author-Name: Junhui Wang Author-X-Name-First: Junhui Author-X-Name-Last: Wang Author-Name: Xiaotong Shen Author-X-Name-First: Xiaotong Author-X-Name-Last: Shen Title: A Group-Specific Recommender System Abstract: In recent years, there has been a growing demand to develop efficient recommender systems which track users’ preferences and recommend potential items of interest to users. In this article, we propose a group-specific method to use dependency information from users and items which share similar characteristics under the singular value decomposition framework. The new approach is effective for the “cold-start” problem, where, in the testing set, majority responses are obtained from new users or for new items, and their preference information is not available from the training set. One advantage of the proposed model is that we are able to incorporate information from the missing mechanism and group-specific features through clustering based on the numbers of ratings from each user and other variables associated with missing patterns. In addition, since this type of data involves large-scale customer records, traditional algorithms are not computationally scalable. To implement the proposed method, we propose a new algorithm that embeds a back-fitting algorithm into alternating least squares, which avoids large matrices operation and big memory storage, and therefore makes it feasible to achieve scalable computing. Our simulation studies and MovieLens data analysis both indicate that the proposed group-specific method improves prediction accuracy significantly compared to existing competitive recommender system approaches. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1344-1353 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1219261 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1219261 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1344-1353 Template-Type: ReDIF-Article 1.0 Author-Name: Mike G. Tsionas Author-X-Name-First: Mike G. Author-X-Name-Last: Tsionas Title: “When, Where, and How” of Efficiency Estimation: Improved Procedures for Stochastic Frontier Modeling Abstract: The issues of functional form, distributions of the error components, and endogeneity are for the most part still open in stochastic frontier models. The same is true when it comes to imposition of restrictions of monotonicity and curvature, making efficiency estimation an elusive goal. In this article, we attempt to consider these problems simultaneously and offer practical solutions to the problems raised by Stone and addressed by Badunenko, Henderson and Kumbhakar. We provide major extensions to smoothly mixing regressions and fractional polynomial approximations for both the functional form of the frontier and the structure of inefficiency. Endogeneity is handled, simultaneously, using copulas. We provide detailed computational experiments and an application to U.S. banks. To explore the posteriors of the new models we rely heavily on sequential Monte Carlo techniques. Journal: Journal of the American Statistical Association Pages: 948-965 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1246364 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1246364 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:948-965 Template-Type: ReDIF-Article 1.0 Author-Name: Zihuai He Author-X-Name-First: Zihuai Author-X-Name-Last: He Author-Name: Min Zhang Author-X-Name-First: Min Author-X-Name-Last: Zhang Author-Name: Seunggeun Lee Author-X-Name-First: Seunggeun Author-X-Name-Last: Lee Author-Name: Jennifer A. Smith Author-X-Name-First: Jennifer A. Author-X-Name-Last: Smith Author-Name: Sharon L. R. Kardia Author-X-Name-First: Sharon L. R. Author-X-Name-Last: Kardia Author-Name: V. Diez Roux Author-X-Name-First: V. Diez Author-X-Name-Last: Roux Author-Name: Bhramar Mukherjee Author-X-Name-First: Bhramar Author-X-Name-Last: Mukherjee Title: Set-Based Tests for the Gene–Environment Interaction in Longitudinal Studies Abstract: We propose a generalized score type test for set-based inference for the gene–environment interaction with longitudinally measured quantitative traits. The test is robust to misspecification of within subject correlation structure and has enhanced power compared to existing alternatives. Unlike tests for marginal genetic association, set-based tests for the gene–environment interaction face the challenges of a potentially misspecified and high-dimensional main effect model under the null hypothesis. We show that our proposed test is robust to main effect misspecification of environmental exposure and genetic factors under the gene–environment independence condition. When genetic and environmental factors are dependent, the method of sieves is further proposed to eliminate potential bias due to a misspecified main effect of a continuous environmental exposure. A weighted principal component analysis approach is developed to perform dimension reduction when the number of genetic variants in the set is large relative to the sample size. The methods are motivated by an example from the Multi-Ethnic Study of Atherosclerosis (MESA), investigating interaction between measures of neighborhood environment and genetic regions on longitudinal measures of blood pressure over a study period of about seven years with four exams. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 966-978 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1252266 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1252266 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:966-978 Template-Type: ReDIF-Article 1.0 Author-Name: Ethan X. Fang Author-X-Name-First: Ethan X. Author-X-Name-Last: Fang Author-Name: Min-Dian Li Author-X-Name-First: Min-Dian Author-X-Name-Last: Li Author-Name: Michael I. Jordan Author-X-Name-First: Michael I. Author-X-Name-Last: Jordan Author-Name: Han Liu Author-X-Name-First: Han Author-X-Name-Last: Liu Title: Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Abstract: Characterizing the functional relevance of transcription factors (TFs) in different biological contexts is pivotal in systems biology. Given the massive amount of genomic data, computational identification of TFs is emerging as a useful approach to bridge functional genomics with disease risk loci. In this article, we use large-scale gene expression and chromatin immunoprecipitation (ChIP) data corpuses to conduct high-throughput TF-biological context association analysis. This work makes two contributions: (i) From a methodological perspective, we propose a unified topic modeling framework for exploring and analyzing large and complex genomic datasets. Under this framework, we develop new statistical optimization algorithms and semiparametric theoretical analysis, which are also applicable to a variety of large-scale data analyses. (ii) From an experimental perspective, our method generates an informative list of tumor-related TFs and their possible effected tumor types. Our data-driven analysis of 38 TFs in 68 tumor biological contexts identifies functional signatures of epigenetic regulators, such as SUZ12 and SET-DB1, and nuclear receptors, in many tumor types. In particular, the TF signature of SUZ12 is present in a broad range of tumor types, many of which have not been reported before. In summary, our work established a robust method to identify the association between TFs and biological contexts. Given the limited amount of genome-wide binding profiles of TFs and the massive number of expression profiles, our work provides a useful tool to deconvolute the gene regulatory network for tumors and other biological contexts. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 921-932 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1256812 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1256812 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:921-932 Template-Type: ReDIF-Article 1.0 Author-Name: Weiyi Xie Author-X-Name-First: Weiyi Author-X-Name-Last: Xie Author-Name: Sebastian Kurtek Author-X-Name-First: Sebastian Author-X-Name-Last: Kurtek Author-Name: Karthik Bharath Author-X-Name-First: Karthik Author-X-Name-Last: Bharath Author-Name: Ying Sun Author-X-Name-First: Ying Author-X-Name-Last: Sun Title: A Geometric Approach to Visualization of Variability in Functional Data Abstract: We propose a new method for the construction and visualization of boxplot-type displays for functional data. We use a recent functional data analysis framework, based on a representation of functions called square-root slope functions, to decompose observed variation in functional data into three main components: amplitude, phase, and vertical translation. We then construct separate displays for each component, using the geometry and metric of each representation space, based on a novel definition of the median, the two quartiles, and extreme observations. The outlyingness of functional data is a very complex concept. Thus, we propose to identify outliers based on any of the three main components after decomposition. We provide a variety of visualization tools for the proposed boxplot-type displays including surface plots. We evaluate the proposed method using extensive simulations and then focus our attention on three real data applications including exploratory data analysis of sea surface temperature functions, electrocardiogram functions, and growth curves. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 979-993 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1256813 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1256813 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:979-993 Template-Type: ReDIF-Article 1.0 Author-Name: Jingfei Zhang Author-X-Name-First: Jingfei Author-X-Name-Last: Zhang Author-Name: Jiguo Cao Author-X-Name-First: Jiguo Author-X-Name-Last: Cao Title: Finding Common Modules in a Time-Varying Network with Application to the Gene Regulation Network Abstract: Finding functional modules in gene regulation networks is an important task in systems biology. Many methods have been proposed for finding communities in static networks; however, the application of such methods is limited due to the dynamic nature of gene regulation networks. In this article, we first propose a statistical framework for detecting common modules in the Drosophila melanogaster time-varying gene regulation network. We then develop both a significance test and a robustness test for the identified modular structure. We apply an enrichment analysis to our community findings, which reveals interesting results. Moreover, we investigate the consistency property of our proposed method under a time-varying stochastic block model framework with a temporal correlation structure. Although we focus on gene regulation networks in our work, our method is general and can be applied to other time-varying networks. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 994-1008 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1260465 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1260465 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:994-1008 Template-Type: ReDIF-Article 1.0 Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Author-Name: Dan Shen Author-X-Name-First: Dan Author-X-Name-Last: Shen Author-Name: Xuewei Peng Author-X-Name-First: Xuewei Author-X-Name-Last: Peng Author-Name: Leo Yufeng Liu Author-X-Name-First: Leo Yufeng Author-X-Name-Last: Liu Title: MWPCR: Multiscale Weighted Principal Component Regression for High-Dimensional Prediction Abstract: We propose a multiscale weighted principal component regression (MWPCR) framework for the use of high-dimensional features with strong spatial features (e.g., smoothness and correlation) to predict an outcome variable, such as disease status. This development is motivated by identifying imaging biomarkers that could potentially aid detection, diagnosis, assessment of prognosis, prediction of response to treatment, and monitoring of disease status, among many others. The MWPCR can be regarded as a novel integration of principal components analysis (PCA), kernel methods, and regression models. In MWPCR, we introduce various weight matrices to prewhitten high-dimensional feature vectors, perform matrix decomposition for both dimension reduction and feature extraction, and build a prediction model by using the extracted features. Examples of such weight matrices include an importance score weight matrix for the selection of individual features at each location and a spatial weight matrix for the incorporation of the spatial pattern of feature vectors. We integrate the importance of score weights with the spatial weights to recover the low-dimensional structure of high-dimensional features. We demonstrate the utility of our methods through extensive simulations and real data analyses of the Alzheimer’s disease neuroimaging initiative (ADNI) dataset. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1009-1021 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1261710 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1261710 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1009-1021 Template-Type: ReDIF-Article 1.0 Author-Name: Tao Wang Author-X-Name-First: Tao Author-X-Name-Last: Wang Author-Name: Hongyu Zhao Author-X-Name-First: Hongyu Author-X-Name-Last: Zhao Title: Constructing Predictive Microbial Signatures at Multiple Taxonomic Levels Abstract: Recent advances in DNA sequencing technology have enabled rapid advances in our understanding of the contribution of the human microbiome to many aspects of normal human physiology and disease. A major goal of human microbiome studies is the identification of important groups of microbes that are predictive of host phenotypes. However, the large number of bacterial taxa and the compositional nature of the data make this goal difficult to achieve using traditional approaches. Furthermore, the microbiome data are structured in the sense that bacterial taxa are not independent of one another and are related evolutionarily by a phylogenetic tree. To deal with these challenges, we introduce the concept of variable fusion for high-dimensional compositional data and propose a novel tree-guided variable fusion method. Our method is based on the linear regression model with tree-guided penalty functions. It incorporates the tree information node-by-node and is capable of building predictive models comprised of bacterial taxa at different taxonomic levels. A gut microbiome data analysis and simulations are presented to illustrate the good performance of the proposed method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1022-1031 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1270213 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270213 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1022-1031 Template-Type: ReDIF-Article 1.0 Author-Name: Sihai Dave Zhao Author-X-Name-First: Sihai Dave Author-X-Name-Last: Zhao Author-Name: T. Tony Cai Author-X-Name-First: T. Tony Author-X-Name-Last: Cai Author-Name: Thomas P. Cappola Author-X-Name-First: Thomas P. Author-X-Name-Last: Cappola Author-Name: Kenneth B. Margulies Author-X-Name-First: Kenneth B. Author-X-Name-Last: Margulies Author-Name: Hongzhe Li Author-X-Name-First: Hongzhe Author-X-Name-Last: Li Title: Sparse Simultaneous Signal Detection for Identifying Genetically Controlled Disease Genes Abstract: Genome-wide association studies (GWAS) and differential expression analyses have had limited success in finding genes that cause complex diseases such as heart failure (HF), a leading cause of death in the United States. This article proposes a new statistical approach that integrates GWAS and expression quantitative trait loci (eQTL) data to identify important HF genes. For such genes, genetic variations that perturb its expression are also likely to influence disease risk. The proposed method thus tests for the presence of simultaneous signals: SNPs that are associated with the gene’s expression as well as with disease. An analytic expression for the p-value is obtained, and the method is shown to be asymptotically adaptively optimal under certain conditions. It also allows the GWAS and eQTL data to be collected from different groups of subjects, enabling investigators to integrate public resources with their own data. Simulation experiments show that it can be more powerful than standard approaches and also robust to linkage disequilibrium between variants. The method is applied to an extensive analysis of HF genomics and identifies several genes with biological evidence for being functionally relevant in the etiology of HF. It is implemented in the R package ssa. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1032-1046 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1270825 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270825 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1032-1046 Template-Type: ReDIF-Article 1.0 Author-Name: E. I. George Author-X-Name-First: E. I. Author-X-Name-Last: George Author-Name: V. Ročková Author-X-Name-First: V. Author-X-Name-Last: Ročková Author-Name: P. R. Rosenbaum Author-X-Name-First: P. R. Author-X-Name-Last: Rosenbaum Author-Name: V. A. Satopää Author-X-Name-First: V. A. Author-X-Name-Last: Satopää Author-Name: J. H. Silber Author-X-Name-First: J. H. Author-X-Name-Last: Silber Title: Mortality Rate Estimation and Standardization for Public Reporting: Medicare’s Hospital Compare Abstract: Bayesian models are increasingly fit to large administrative datasets and then used to make individualized recommendations. In particular, Medicare’s Hospital Compare webpage provides information to patients about specific hospital mortality rates for a heart attack or acute myocardial infarction (AMI). Hospital Compare’s current recommendations are based on a random-effects logit model with a random hospital indicator and patient risk factors. Except for the largest hospitals, these individual recommendations or predictions are not checkable against data, because data from smaller hospitals are too limited to provide a meaningful check. Before individualized Bayesian recommendations, people derived general advice from empirical studies of many hospitals, for example, prefer hospitals of Type 1 to Type 2 because the risk is lower at Type 1 hospitals. Here, we calibrate these Bayesian recommendation systems by checking, out of sample, whether their predictions aggregate to give correct general advice derived from another sample. This process of calibrating individualized predictions against general empirical advice leads to substantial revisions in the Hospital Compare model for AMI mortality. To make appropriately calibrated predictions, our revised models incorporate information about hospital volume, nursing staff, medical residents, and the hospital’s ability to perform cardiovascular procedures. For the ultimate purpose of comparisons, hospital mortality rates must be standardized to adjust for patient mix variation across hospitals. We find that indirect standardization, as currently used by Hospital Compare, fails to adequately control for differences in patient risk factors and systematically underestimates mortality rates at the low volume hospitals. To provide good control and correctly calibrated rates, we propose direct standardization instead. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 933-947 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1276021 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1276021 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:933-947 Template-Type: ReDIF-Article 1.0 Author-Name: Wesley Tansey Author-X-Name-First: Wesley Author-X-Name-Last: Tansey Author-Name: Alex Athey Author-X-Name-First: Alex Author-X-Name-Last: Athey Author-Name: Alex Reinhart Author-X-Name-First: Alex Author-X-Name-Last: Reinhart Author-Name: James G. Scott Author-X-Name-First: James G. Author-X-Name-Last: Scott Title: Multiscale Spatial Density Smoothing: An Application to Large-Scale Radiological Survey and Anomaly Detection Abstract: We consider the problem of estimating a spatially varying density function, motivated by problems that arise in large-scale radiological survey and anomaly detection. In this context, the density functions to be estimated are the background gamma-ray energy spectra at sites spread across a large geographical area, such as nuclear production and waste-storage sites, military bases, medical facilities, university campuses, or the downtown of a city. Several challenges combine to make this a difficult problem. First, the spectral density at any given spatial location may have both smooth and nonsmooth features. Second, the spatial correlation in these density functions is neither stationary nor locally isotropic. Finally, at some spatial locations, there are very little data. We present a method called multiscale spatial density smoothing that successfully addresses these challenges. The method is based on recursive dyadic partition of the sample space, and therefore shares much in common with other multiscale methods, such as wavelets and Pólya-tree priors. We describe an efficient algorithm for finding a maximum a posteriori (MAP) estimate that leverages recent advances in convex optimization for nonsmooth functions.We apply multiscale spatial density smoothing to real data collected on the background gamma-ray spectra at locations across a large university campus. The method exhibits state-of-the-art performance for spatial smoothing in density estimation, and it leads to substantial improvements in power when used in conjunction with existing methods for detecting the kinds of radiological anomalies that may have important consequences for public health and safety. Journal: Journal of the American Statistical Association Pages: 1047-1063 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2016.1276461 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1276461 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1047-1063 Template-Type: ReDIF-Article 1.0 Author-Name: Eric D. Schoen Author-X-Name-First: Eric D. Author-X-Name-Last: Schoen Author-Name: Nha Vo-Thanh Author-X-Name-First: Nha Author-X-Name-Last: Vo-Thanh Author-Name: Peter Goos Author-X-Name-First: Peter Author-X-Name-Last: Goos Title: Two-Level Orthogonal Screening Designs With 24, 28, 32, and 36 Runs Abstract: The potential of two-level orthogonal designs to fit models with main effects and two-factor interaction effects is commonly assessed through the correlation between contrast vectors involving these effects. We study the complete catalog of nonisomorphic orthogonal two-level 24-run designs involving 3–23 factors and we identify the best few designs in terms of these correlations. By modifying an existing enumeration algorithm, we identify the best few 28-run designs involving 3–14 factors and the best few 36-run designs in 3–18 factors as well. Based on a complete catalog of 7570 designs with 28 runs and 27 factors, we also seek good 28-run designs with more than 14 factors. Finally, starting from a unique 31-factor design in 32 runs that minimizes the maximum correlation among the contrast vectors for main effects and two-factor interactions, we obtain 32-run designs that have low values for this correlation. To demonstrate the added value of our work, we provide a detailed comparison of our designs to the alternatives available in the literature. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1354-1369 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2017.1279547 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1279547 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1354-1369 Template-Type: ReDIF-Article 1.0 Author-Name: Blakeley B. McShane Author-X-Name-First: Blakeley B. Author-X-Name-Last: McShane Author-Name: David Gal Author-X-Name-First: David Author-X-Name-Last: Gal Title: Statistical Significance and the Dichotomization of Evidence Abstract: In light of recent concerns about reproducibility and replicability, the ASA issued a Statement on Statistical Significance and p-values aimed at those who are not primarily statisticians. While the ASA Statement notes that statistical significance and p-values are “commonly misused and misinterpreted,” it does not discuss and document broader implications of these errors for the interpretation of evidence. In this article, we review research on how applied researchers who are not primarily statisticians misuse and misinterpret p-values in practice and how this can lead to errors in the interpretation of evidence. We also present new data showing, perhaps surprisingly, that researchers who are primarily statisticians are also prone to misuse and misinterpret p-values thus resulting in similar errors. In particular, we show that statisticians tend to interpret evidence dichotomously based on whether or not a p-value crosses the conventional 0.05 threshold for statistical significance. We discuss implications and offer recommendations. Journal: Journal of the American Statistical Association Pages: 885-895 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2017.1289846 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1289846 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:885-895 Template-Type: ReDIF-Article 1.0 Author-Name: Alfredo Farjat Author-X-Name-First: Alfredo Author-X-Name-Last: Farjat Author-Name: Brian J. Reich Author-X-Name-First: Brian J. Author-X-Name-Last: Reich Author-Name: Joseph Guinness Author-X-Name-First: Joseph Author-X-Name-Last: Guinness Author-Name: Ross Whetten Author-X-Name-First: Ross Author-X-Name-Last: Whetten Author-Name: Steven McKeand Author-X-Name-First: Steven Author-X-Name-Last: McKeand Author-Name: Fikret Isik Author-X-Name-First: Fikret Author-X-Name-Last: Isik Title: Optimal Seed Deployment Under Climate Change Using Spatial Models: Application to Loblolly Pine in the Southeastern US Abstract: Provenance tests are a common tool in forestry designed to identify superior genotypes for planting at specific locations. The trials are replicated experiments established with seed from parent trees collected from different regions and grown at several locations. In this work, a Bayesian spatial approach is developed for modeling the expected relative performance of seed sources using climate variables as predictors associated with the origin of seed source and the planting site. The proposed modeling technique accounts for the spatial dependence in the data and introduces a separable Matérn covariance structure that provides a flexible means to estimate effects associated with the origin and planting site locations. The statistical model was used to develop a quantitative tool for seed deployment aimed to identify the location of superior performing seed sources that could be suitable for a specific planting site under a given climate scenario. Cross-validation results indicate that the proposed spatial models provide superior predictive ability compared to multiple linear regression methods in unobserved locations. The general trend of performance predictions based on future climate scenarios suggests an optimal assisted migration of loblolly pine seed sources from southern and warmer regions to northern and colder areas in the southern USA. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 909-920 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2017.1292179 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1292179 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:909-920 Template-Type: ReDIF-Article 1.0 Author-Name: Andrew Gelman Author-X-Name-First: Andrew Author-X-Name-Last: Gelman Author-Name: John Carlin Author-X-Name-First: John Author-X-Name-Last: Carlin Title: Some Natural Solutions to the -Value Communication Problem—and Why They Won’t Work Journal: Journal of the American Statistical Association Pages: 899-901 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2017.1311263 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1311263 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:899-901 Template-Type: ReDIF-Article 1.0 Author-Name: William M. Briggs Author-X-Name-First: William M. Author-X-Name-Last: Briggs Title: The Substitute for -Values Abstract: If it was not obvious before, after reading McShane and Gal, the conclusion is that p-values should be proscribed. There are no good uses for them; indeed, every use either violates frequentist theory, is fallacious, or is based on a misunderstanding. A replacement for p-values is suggested, based on predictive models. Journal: Journal of the American Statistical Association Pages: 897-898 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2017.1311264 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1311264 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:897-898 Template-Type: ReDIF-Article 1.0 Author-Name: Eric B. Laber Author-X-Name-First: Eric B. Author-X-Name-Last: Laber Author-Name: Kerby Shedden Author-X-Name-First: Kerby Author-X-Name-Last: Shedden Title: Statistical Significance and the Dichotomization of Evidence: The Relevance of the ASA Statement on Statistical Significance and p-Values for Statisticians Journal: Journal of the American Statistical Association Pages: 902-904 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2017.1311265 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1311265 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:902-904 Template-Type: ReDIF-Article 1.0 Author-Name: Donald Berry Author-X-Name-First: Donald Author-X-Name-Last: Berry Title: A -Value to Die For Journal: Journal of the American Statistical Association Pages: 895-897 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2017.1316279 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1316279 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:895-897 Template-Type: ReDIF-Article 1.0 Author-Name: Blakeley B. McShane Author-X-Name-First: Blakeley B. Author-X-Name-Last: McShane Author-Name: David Gal Author-X-Name-First: David Author-X-Name-Last: Gal Title: Rejoinder: Statistical Significance and the Dichotomization of Evidence Journal: Journal of the American Statistical Association Pages: 904-908 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2017.1323642 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1323642 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:904-908 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Book Reviews Journal: Journal of the American Statistical Association Pages: 1370-1379 Issue: 519 Volume: 112 Year: 2017 Month: 7 X-DOI: 10.1080/01621459.2017.1367179 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1367179 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1370-1379 Template-Type: ReDIF-Article 1.0 Author-Name: P. Richard Hahn Author-X-Name-First: P. Richard Author-X-Name-Last: Hahn Author-Name: Ryan Martin Author-X-Name-First: Ryan Author-X-Name-Last: Martin Author-Name: Stephen G. Walker Author-X-Name-First: Stephen G. Author-X-Name-Last: Walker Title: On Recursive Bayesian Predictive Distributions Abstract: A Bayesian framework is attractive in the context of prediction, but a fast recursive update of the predictive distribution has apparently been out of reach, in part because Monte Carlo methods are generally used to compute the predictive. This article shows that online Bayesian prediction is possible by characterizing the Bayesian predictive update in terms of a bivariate copula, making it unnecessary to pass through the posterior to update the predictive. In standard models, the Bayesian predictive update corresponds to familiar choices of copula but, in nonparametric problems, the appropriate copula may not have a closed-form expression. In such cases, our new perspective suggests a fast recursive approximation to the predictive density, in the spirit of Newton’s predictive recursion algorithm, but without requiring evaluation of normalizing constants. Consistency of the new algorithm is shown, and numerical examples demonstrate its quality performance in finite-samples compared to fully Bayesian and kernel methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1085-1093 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1304219 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1304219 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1085-1093 Template-Type: ReDIF-Article 1.0 Author-Name: Audrey Boruvka Author-X-Name-First: Audrey Author-X-Name-Last: Boruvka Author-Name: Daniel Almirall Author-X-Name-First: Daniel Author-X-Name-Last: Almirall Author-Name: Katie Witkiewitz Author-X-Name-First: Katie Author-X-Name-Last: Witkiewitz Author-Name: Susan A. Murphy Author-X-Name-First: Susan A. Author-X-Name-Last: Murphy Title: Assessing Time-Varying Causal Effect Moderation in Mobile Health Abstract: In mobile health interventions aimed at behavior change and maintenance, treatments are provided in real time to manage current or impending high-risk situations or promote healthy behaviors in near real time. Currently there is great scientific interest in developing data analysis approaches to guide the development of mobile interventions. In particular data from mobile health studies might be used to examine effect moderators—individual characteristics, time-varying context, or past treatment response that moderate the effect of current treatment on a subsequent response. This article introduces a formal definition for moderated effects in terms of potential outcomes, a definition that is particularly suited to mobile interventions, where treatment occasions are numerous, individuals are not always available for treatment, and potential moderators might be influenced by past treatment. Methods for estimating moderated effects are developed and compared. The proposed approach is illustrated using BASICS-Mobile, a smartphone-based intervention designed to curb heavy drinking and smoking among college students. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1112-1121 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1305274 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1305274 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1112-1121 Template-Type: ReDIF-Article 1.0 Author-Name: Ashkan Ertefaie Author-X-Name-First: Ashkan Author-X-Name-Last: Ertefaie Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Author-Name: Paul R. Rosenbaum Author-X-Name-First: Paul R. Author-X-Name-Last: Rosenbaum Title: Quantitative Evaluation of the Trade-Off of Strengthened Instruments and Sample Size in Observational Studies Abstract: Weak instruments produce causal inferences that are sensitive to small failures of the assumptions underlying an instrumental variable, so strong instruments are preferred. The possibility of strengthening an instrument at the price of a reduced sample size has been proposed in the statistical literature and used in the medical literature, but there has not been a theoretical study of the trade-off of instrument strength and sample size. This trade-off and related questions are examined using the Bahadur efficiency of a test or a sensitivity analysis. A moderate increase in instrument strength is worth more than an enormous increase in sample size. This is true with a flawless instrument, and the difference is more pronounced when allowance is made for small unmeasured biases in the instrument. A new method of strengthening an instrument is proposed: it discards half the sample to learn empirically where the instrument is strong, then discards part of the remaining half to avoid areas where the instrument is weak; however, the gains in instrument strength can more than compensate for the loss of sample size. The example is drawn from a study of the effectiveness of high-level neonatal intensive care units in saving the lives of premature infants. Journal: Journal of the American Statistical Association Pages: 1122-1134 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1305275 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1305275 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1122-1134 Template-Type: ReDIF-Article 1.0 Author-Name: Chao Wang Author-X-Name-First: Chao Author-X-Name-Last: Wang Author-Name: Kung-Sik Chan Author-X-Name-First: Kung-Sik Author-X-Name-Last: Chan Title: Quasi-Likelihood Estimation of a Censored Autoregressive Model With Exogenous Variables Abstract: Maximum likelihood estimation of a censored autoregressive model with exogenous variables (CARX) requires computing the conditional likelihood of blocks of data of variable dimensions. As the random block dimension generally increases with the censoring rate, maximum likelihood estimation becomes quickly numerically intractable with increasing censoring. We introduce a new estimation approach using the complete-incomplete data framework with the complete data comprising the observations were there no censoring. We introduce a system of unbiased estimating equations motivated by the complete-data score vector, for estimating a CARX model. The proposed quasi-likelihood method reduces to maximum likelihood estimation when there is no censoring, and it is computationally efficient. We derive the consistency and asymptotic normality of the quasi-likelihood estimator, under mild regularity conditions. We illustrate the efficacy of the proposed method by simulations and a real application on phosphorus concentration in river water. Journal: Journal of the American Statistical Association Pages: 1135-1145 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1307115 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1307115 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1135-1145 Template-Type: ReDIF-Article 1.0 Author-Name: Jing Lei Author-X-Name-First: Jing Author-X-Name-Last: Lei Author-Name: Max G’Sell Author-X-Name-First: Max Author-X-Name-Last: G’Sell Author-Name: Alessandro Rinaldo Author-X-Name-First: Alessandro Author-X-Name-Last: Rinaldo Author-Name: Ryan J. Tibshirani Author-X-Name-First: Ryan J. Author-X-Name-Last: Tibshirani Author-Name: Larry Wasserman Author-X-Name-First: Larry Author-X-Name-Last: Wasserman Title: Distribution-Free Predictive Inference for Regression Abstract: We develop a general framework for distribution-free predictive inference in regression, using conformal inference. The proposed methodology allows for the construction of a prediction band for the response variable using any estimator of the regression function. The resulting prediction band preserves the consistency properties of the original estimator under standard assumptions, while guaranteeing finite-sample marginal coverage even when these assumptions do not hold. We analyze and compare, both empirically and theoretically, the two major variants of our conformal framework: full conformal inference and split conformal inference, along with a related jackknife method. These methods offer different tradeoffs between statistical accuracy (length of resulting prediction intervals) and computational efficiency. As extensions, we develop a method for constructing valid in-sample prediction intervals called rank-one-out conformal inference, which has essentially the same computational efficiency as split conformal inference. We also describe an extension of our procedures for producing prediction bands with locally varying length, to adapt to heteroscedasticity in the data. Finally, we propose a model-free notion of variable importance, called leave-one-covariate-out or LOCO inference. Accompanying this article is an R package conformalInference that implements all of the proposals we have introduced. In the spirit of reproducibility, all of our empirical results can also be easily (re)generated using this package. Journal: Journal of the American Statistical Association Pages: 1094-1111 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1307116 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1307116 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1094-1111 Template-Type: ReDIF-Article 1.0 Author-Name: Hao Chen Author-X-Name-First: Hao Author-X-Name-Last: Chen Author-Name: Xu Chen Author-X-Name-First: Xu Author-X-Name-Last: Chen Author-Name: Yi Su Author-X-Name-First: Yi Author-X-Name-Last: Su Title: A Weighted Edge-Count Two-Sample Test for Multivariate and Object Data Abstract: Two-sample tests for multivariate data and non-Euclidean data are widely used in many fields. Parametric tests are mostly restrained to certain types of data that meets the assumptions of the parametric models. In this article, we study a nonparametric testing procedure that uses graphs representing the similarity among observations. It can be applied to any data types as long as an informative similarity measure on the sample space can be defined. The classic test based on a similarity graph has a problem when the two sample sizes are different. We solve the problem by applying appropriate weights to different components of the classic test statistic. The new test exhibits substantial power gains in simulation studies. Its asymptotic permutation null distribution is derived and shown to work well under finite samples, facilitating its application to large datasets. The new test is illustrated through an analysis on a real dataset of network data. Journal: Journal of the American Statistical Association Pages: 1146-1155 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1307757 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1307757 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1146-1155 Template-Type: ReDIF-Article 1.0 Author-Name: Wesley Tansey Author-X-Name-First: Wesley Author-X-Name-Last: Tansey Author-Name: Oluwasanmi Koyejo Author-X-Name-First: Oluwasanmi Author-X-Name-Last: Koyejo Author-Name: Russell A. Poldrack Author-X-Name-First: Russell A. Author-X-Name-Last: Poldrack Author-Name: James G. Scott Author-X-Name-First: James G. Author-X-Name-Last: Scott Title: False Discovery Rate Smoothing Abstract: We present false discovery rate (FDR) smoothing, an empirical-Bayes method for exploiting spatial structure in large multiple-testing problems. FDR smoothing automatically finds spatially localized regions of significant test statistics. It then relaxes the threshold of statistical significance within these regions, and tightens it elsewhere, in a manner that controls the overall false discovery rate at a given level. This results in increased power and cleaner spatial separation of signals from noise. The approach requires solving a nonstandard high-dimensional optimization problem, for which an efficient augmented-Lagrangian algorithm is presented. In simulation studies, FDR smoothing exhibits state-of-the-art performance at modest computational cost. In particular, it is shown to be far more robust than existing methods for spatially dependent multiple testing. We also apply the method to a dataset from an fMRI experiment on spatial working memory, where it detects patterns that are much more biologically plausible than those detected by standard FDR-controlling methods. All code for FDR smoothing is publicly available in Python and R (https://github.com/tansey/smoothfdr). Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1156-1171 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1319838 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1319838 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1156-1171 Template-Type: ReDIF-Article 1.0 Author-Name: Stefan Wager Author-X-Name-First: Stefan Author-X-Name-Last: Wager Author-Name: Susan Athey Author-X-Name-First: Susan Author-X-Name-Last: Athey Title: Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Abstract: Many scientific and engineering challenges—ranging from personalized medicine to customized marketing recommendations—require an understanding of treatment effect heterogeneity. In this article, we develop a nonparametric causal forest for estimating heterogeneous treatment effects that extends Breiman’s widely used random forest algorithm. In the potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms. To our knowledge, this is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially in the presence of irrelevant covariates. Journal: Journal of the American Statistical Association Pages: 1228-1242 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1319839 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1319839 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1228-1242 Template-Type: ReDIF-Article 1.0 Author-Name: Sokbae Lee Author-X-Name-First: Sokbae Author-X-Name-Last: Lee Author-Name: Yuan Liao Author-X-Name-First: Yuan Author-X-Name-Last: Liao Author-Name: Myung Hwan Seo Author-X-Name-First: Myung Hwan Author-X-Name-Last: Seo Author-Name: Youngki Shin Author-X-Name-First: Youngki Author-X-Name-Last: Shin Title: Oracle Estimation of a Change Point in High-Dimensional Quantile Regression Abstract: In this article, we consider a high-dimensional quantile regression model where the sparsity structure may differ between two sub-populations. We develop ℓ1-penalized estimators of both regression coefficients and the threshold parameter. Our penalized estimators not only select covariates but also discriminate between a model with homogenous sparsity and a model with a change point. As a result, it is not necessary to know or pretest whether the change point is present, or where it occurs. Our estimator of the change point achieves an oracle property in the sense that its asymptotic distribution is the same as if the unknown active sets of regression coefficients were known. Importantly, we establish this oracle property without a perfect covariate selection, thereby avoiding the need for the minimum level condition on the signals of active covariates. Dealing with high-dimensional quantile regression with an unknown change point calls for a new proof technique since the quantile loss function is nonsmooth and furthermore the corresponding objective function is nonconvex with respect to the change point. The technique developed in this article is applicable to a general M-estimation framework with a change point, which may be of independent interest. The proposed methods are then illustrated via Monte Carlo experiments and an application to tipping in the dynamics of racial segregation. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1184-1194 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1319840 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1319840 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1184-1194 Template-Type: ReDIF-Article 1.0 Author-Name: Quentin Clairon Author-X-Name-First: Quentin Author-X-Name-Last: Clairon Author-Name: Nicolas J.-B. Brunel Author-X-Name-First: Nicolas J.-B. Author-X-Name-Last: Brunel Title: Optimal Control and Additive Perturbations Help in Estimating Ill-Posed and Uncertain Dynamical Systems Abstract: Ordinary differential equations (ODE) are routinely calibrated on real data for estimating unknown parameters or for reverse-engineering. Nevertheless, standard statistical techniques can give disappointing results because of the complex relationship between parameters and states, which makes the corresponding estimation problem ill-posed. Moreover, ODE are mechanistic models that are prone to modeling errors, whose influences on inference are often neglected during statistical analysis. We propose a regularized estimation framework, called Tracking, which consists in adding a perturbation (L2 function) to the original ODE. This perturbation facilitates data fitting and represents also possible model misspecifications, so that parameter estimation is done by solving a trade-off between data fidelity and model fidelity. We show that the underlying optimization problem is an optimal control problem that can be solved by the Pontryagin maximum principle for general nonlinear and partially observed ODE. The same methodology can be used for the joint estimation of finite and time-varying parameters. We show, in the case of a well-specified parametric model that our estimator is consistent and reaches the root-n rate. In addition, numerical experiments considering various sources of model misspecifications shows that Tracking still furnishes accurate estimates. Finally, we consider semiparametric estimation on both simulated data and on a real data example. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1195-1209 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1319841 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1319841 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1195-1209 Template-Type: ReDIF-Article 1.0 Author-Name: José R. Berrendero Author-X-Name-First: José R. Author-X-Name-Last: Berrendero Author-Name: Antonio Cuevas Author-X-Name-First: Antonio Author-X-Name-Last: Cuevas Author-Name: José L. Torrecilla Author-X-Name-First: José L. Author-X-Name-Last: Torrecilla Title: On the Use of Reproducing Kernel Hilbert Spaces in Functional Classification Abstract: The Hájek–Feldman dichotomy establishes that two Gaussian measures are either mutually absolutely continuous with respect to each other (and hence there is a Radon–Nikodym density for each measure with respect to the other one) or mutually singular. Unlike the case of finite-dimensional Gaussian measures, there are nontrivial examples of both situations when dealing with Gaussian stochastic processes. This article provides: (a) Explicit expressions for the optimal (Bayes) rule and the minimal classification error probability in several relevant problems of supervised binary classification of mutually absolutely continuous Gaussian processes. The approach relies on some classical results in the theory of reproducing kernel Hilbert spaces (RKHS). (b) An interpretation, in terms of mutual singularity, for the so-called “near perfect classification” phenomenon. We show that the asymptotically optimal rule proposed by these authors can be identified with the sequence of optimal rules for an approximating sequence of classification problems in the absolutely continuous case. (c) As an application, we discuss a natural variable selection method, which essentially consists of taking the original functional data X(t), t ∈ [0, 1] to a d-dimensional marginal (X(t1), …, X(td)), which is chosen to minimize the classification error of the corresponding Fisher’s linear rule. We give precise conditions under which this discrimination method achieves the minimal classification error of the original functional problem. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1210-1218 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1320287 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1320287 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1210-1218 Template-Type: ReDIF-Article 1.0 Author-Name: Eric B. Laber Author-X-Name-First: Eric B. Author-X-Name-Last: Laber Author-Name: Ana-Maria Staicu Author-X-Name-First: Ana-Maria Author-X-Name-Last: Staicu Title: Functional Feature Construction for Individualized Treatment Regimes Abstract: Evidence-based personalized medicine formalizes treatment selection as an individualized treatment regime that maps up-to-date patient information into the space of possible treatments. Available patient information may include static features such race, gender, family history, genetic and genomic information, as well as longitudinal information including the emergence of comorbidities, waxing and waning of symptoms, side-effect burden, and adherence. Dynamic information measured at multiple time points before treatment assignment should be included as input to the treatment regime. However, subject longitudinal measurements are typically sparse, irregularly spaced, noisy, and vary in number across subjects. Existing estimators for treatment regimes require equal information be measured on each subject and thus standard practice is to summarize longitudinal subject information into a scalar, ad hoc summary during data preprocessing. This reduction of the longitudinal information to a scalar feature precedes estimation of a treatment regime and is therefore not informed by subject outcomes, treatments, or covariates. Furthermore, we show that this reduction requires more stringent causal assumptions for consistent estimation than are necessary. We propose a data-driven method for constructing maximally prescriptive yet interpretable features that can be used with standard methods for estimating optimal treatment regimes. In our proposed framework, we treat the subject longitudinal information as a realization of a stochastic process observed with error at discrete time points. Functionals of this latent process are then combined with outcome models to estimate an optimal treatment regime. The proposed methodology requires weaker causal assumptions than Q-learning with an ad hoc scalar summary and is consistent for the optimal treatment regime. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1219-1227 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1321545 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1321545 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1219-1227 Template-Type: ReDIF-Article 1.0 Author-Name: Lorenzo Trapani Author-X-Name-First: Lorenzo Author-X-Name-Last: Trapani Title: A Randomized Sequential Procedure to Determine the Number of Factors Abstract: This article proposes a procedure to estimate the number of common factors k in a static approximate factor model. The building block of the analysis is the fact that the first k eigenvalues of the covariance matrix of the data diverge, while the others stay bounded. On the grounds of this, we propose a test for the null that the ith eigenvalue diverges, using a randomized test statistic based directly on the estimated eigenvalue. The test only requires minimal assumptions on the data, and no assumptions are required on factors, loadings or idiosyncratic errors. The randomized tests are then employed in a sequential procedure to determine k. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1341-1349 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1328359 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1328359 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1341-1349 Template-Type: ReDIF-Article 1.0 Author-Name: Matias D. Cattaneo Author-X-Name-First: Matias D. Author-X-Name-Last: Cattaneo Author-Name: Michael Jansson Author-X-Name-First: Michael Author-X-Name-Last: Jansson Author-Name: Whitney K. Newey Author-X-Name-First: Whitney K. Author-X-Name-Last: Newey Title: Inference in Linear Regression Models with Many Covariates and Heteroscedasticity Abstract: The linear regression model is widely used in empirical work in economics, statistics, and many other disciplines. Researchers often include many covariates in their linear model specification in an attempt to control for confounders. We give inference methods that allow for many covariates and heteroscedasticity. Our results are obtained using high-dimensional approximations, where the number of included covariates is allowed to grow as fast as the sample size. We find that all of the usual versions of Eicker–White heteroscedasticity consistent standard error estimators for linear models are inconsistent under this asymptotics. We then propose a new heteroscedasticity consistent standard error formula that is fully automatic and robust to both (conditional) heteroscedasticity of unknown form and the inclusion of possibly many covariates. We apply our findings to three settings: parametric linear models with many covariates, linear panel models with many fixed effects, and semiparametric semi-linear models with many technical regressors. Simulation evidence consistent with our theoretical results is provided, and the proposed methods are also illustrated with an empirical application. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1350-1361 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1328360 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1328360 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1350-1361 Template-Type: ReDIF-Article 1.0 Author-Name: Quan Zhou Author-X-Name-First: Quan Author-X-Name-Last: Zhou Author-Name: Yongtao Guan Author-X-Name-First: Yongtao Author-X-Name-Last: Guan Title: On the Null Distribution of Bayes Factors in Linear Regression Abstract: We show that under the null, the 2log(Bayesfactor)$2 \log (\text{Bayes factor})$ is asymptotically distributed as a weighted sum of chi-squared random variables with a shifted mean. This claim holds for Bayesian multi-linear regression with a family of conjugate priors, namely, the normal-inverse-gamma prior, the g-prior, and the normal prior. Our results have three immediate impacts. First, we can compute analytically a p-value associated with a Bayes factor without the need of permutation. We provide a software package that can evaluate the p-value associated with Bayes factor efficiently and accurately. Second, the null distribution is illuminating to some intrinsic properties of Bayes factor, namely, how Bayes factor quantitatively depends on prior and the genesis of Bartlett’s paradox. Third, enlightened by the null distribution of Bayes factor, we formulate a novel scaled Bayes factor that depends less on the prior and is immune to Bartlett’s paradox. When two tests have an identical p-value, the test with a larger power tends to have a larger scaled Bayes factor, a desirable property that is missing for the (unscaled) Bayes factor. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1362-1371 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1328361 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1328361 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1362-1371 Template-Type: ReDIF-Article 1.0 Author-Name: Lan Wang Author-X-Name-First: Lan Author-X-Name-Last: Wang Author-Name: Yu Zhou Author-X-Name-First: Yu Author-X-Name-Last: Zhou Author-Name: Rui Song Author-X-Name-First: Rui Author-X-Name-Last: Song Author-Name: Ben Sherwood Author-X-Name-First: Ben Author-X-Name-Last: Sherwood Title: Quantile-Optimal Treatment Regimes Abstract: Finding the optimal treatment regime (or a series of sequential treatment regimes) based on individual characteristics has important applications in areas such as precision medicine, government policies, and active labor market interventions. In the current literature, the optimal treatment regime is usually defined as the one that maximizes the average benefit in the potential population. This article studies a general framework for estimating the quantile-optimal treatment regime, which is of importance in many real-world applications. Given a collection of treatment regimes, we consider robust estimation of the quantile-optimal treatment regime, which does not require the analyst to specify an outcome regression model. We propose an alternative formulation of the estimator as a solution of an optimization problem with an estimated nuisance parameter. This novel representation allows us to investigate the asymptotic theory of the estimated optimal treatment regime using empirical process techniques. We derive theory involving a nonstandard convergence rate and a nonnormal limiting distribution. The same nonstandard convergence rate would also occur if the mean optimality criterion is applied, but this has not been studied. Thus, our results fill an important theoretical gap for a general class of policy search methods in the literature. The article investigates both static and dynamic treatment regimes. In addition, doubly robust estimation and alternative optimality criterion such as that based on Gini’s mean difference or weighted quantiles are investigated. Numerical simulations demonstrate the performance of the proposed estimator. A data example from a trial in HIV+ patients is used to illustrate the application. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1243-1254 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1330204 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1330204 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1243-1254 Template-Type: ReDIF-Article 1.0 Author-Name: Pallavi Basu Author-X-Name-First: Pallavi Author-X-Name-Last: Basu Author-Name: T. Tony Cai Author-X-Name-First: T. Tony Author-X-Name-Last: Cai Author-Name: Kiranmoy Das Author-X-Name-First: Kiranmoy Author-X-Name-Last: Das Author-Name: Wenguang Sun Author-X-Name-First: Wenguang Author-X-Name-Last: Sun Title: Weighted False Discovery Rate Control in Large-Scale Multiple Testing Abstract: The use of weights provides an effective strategy to incorporate prior domain knowledge in large-scale inference. This article studies weighted multiple testing in a decision-theoretical framework. We develop oracle and data-driven procedures that aim to maximize the expected number of true positives subject to a constraint on the weighted false discovery rate. The asymptotic validity and optimality of the proposed methods are established. The results demonstrate that incorporating informative domain knowledge enhances the interpretability of results and precision of inference. Simulation studies show that the proposed method controls the error rate at the nominal level, and the gain in power over existing methods is substantial in many settings. An application to a genome-wide association study is discussed. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1172-1183 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1336443 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1336443 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1172-1183 Template-Type: ReDIF-Article 1.0 Author-Name: Thomas A. Murray Author-X-Name-First: Thomas A. Author-X-Name-Last: Murray Author-Name: Ying Yuan Author-X-Name-First: Ying Author-X-Name-Last: Yuan Author-Name: Peter F. Thall Author-X-Name-First: Peter F. Author-X-Name-Last: Thall Title: A Bayesian Machine Learning Approach for Optimizing Dynamic Treatment Regimes Abstract: Medical therapy often consists of multiple stages, with a treatment chosen by the physician at each stage based on the patient’s history of treatments and clinical outcomes. These decisions can be formalized as a dynamic treatment regime. This article describes a new approach for optimizing dynamic treatment regimes, which bridges the gap between Bayesian inference and existing approaches, like Q-learning. The proposed approach fits a series of Bayesian regression models, one for each stage, in reverse sequential order. Each model uses as a response variable the remaining payoff assuming optimal actions are taken at subsequent stages, and as covariates the current history and relevant actions at that stage. The key difficulty is that the optimal decision rules at subsequent stages are unknown, and even if these decision rules were known the relevant response variables may be counterfactual. However, posterior distributions can be derived from the previously fitted regression models for the optimal decision rules and the counterfactual response variables under a particular set of rules. The proposed approach averages over these posterior distributions when fitting each regression model. An efficient sampling algorithm for estimation is presented, along with simulation studies that compare the proposed approach with Q-learning. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1255-1267 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1340887 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1340887 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1255-1267 Template-Type: ReDIF-Article 1.0 Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Donggyu Kim Author-X-Name-First: Donggyu Author-X-Name-Last: Kim Title: Robust High-Dimensional Volatility Matrix Estimation for High-Frequency Factor Model Abstract: High-frequency financial data allow us to estimate large volatility matrices with relatively short time horizon. Many novel statistical methods have been introduced to address large volatility matrix estimation problems from a high-dimensional Itô process with microstructural noise contamination. Their asymptotic theories require sub-Gaussian or some finite high-order moments assumptions for observed log-returns. These assumptions are at odd with the heavy tail phenomenon that is pandemic in financial stock returns and new procedures are needed to mitigate the influence of heavy tails. In this article, we introduce the Huber loss function with a diverging threshold to develop a robust realized volatility estimation. We show that it has the sub-Gaussian concentration around the volatility with only finite fourth moments of observed log-returns. With the proposed robust estimator as input, we further regularize it by using the principal orthogonal component thresholding (POET) procedure to estimate the large volatility matrix that admits an approximate factor structure. We establish the asymptotic theories for such low-rank plus sparse matrices. The simulation study is conducted to check the finite sample performance of the proposed estimation methods. Journal: Journal of the American Statistical Association Pages: 1268-1283 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1340888 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1340888 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1268-1283 Template-Type: ReDIF-Article 1.0 Author-Name: Gongjun Xu Author-X-Name-First: Gongjun Author-X-Name-Last: Xu Author-Name: Zhuoran Shang Author-X-Name-First: Zhuoran Author-X-Name-Last: Shang Title: Identifying Latent Structures in Restricted Latent Class Models Abstract: This article focuses on a family of restricted latent structure models with wide applications in psychological and educational assessment, where the model parameters are restricted via a latent structure matrix to reflect prespecified assumptions on the latent attributes. Such a latent matrix is often provided by experts and assumed to be correct upon construction, yet it may be subjective and misspecified. Recognizing this problem, researchers have been developing methods to estimate the matrix from data. However, the fundamental issue of the identifiability of the latent structure matrix has not been addressed until now. The first goal of this article is to establish identifiability conditions that ensure the estimability of the structure matrix. With the theoretical development, the second part of the article proposes a likelihood-based method to estimate the latent structure from the data. Simulation studies show that the proposed method outperforms the existing approaches. We further illustrate the method through a dataset in educational assessment. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1284-1295 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1340889 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1340889 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1284-1295 Template-Type: ReDIF-Article 1.0 Author-Name: Jonas Mueller Author-X-Name-First: Jonas Author-X-Name-Last: Mueller Author-Name: Tommi Jaakkola Author-X-Name-First: Tommi Author-X-Name-Last: Jaakkola Author-Name: David Gifford Author-X-Name-First: David Author-X-Name-Last: Gifford Title: Modeling Persistent Trends in Distributions Abstract: We present a nonparametric framework to model a short sequence of probability distributions that vary both due to underlying effects of sequential progression and confounding noise. To distinguish between these two types of variation and estimate the sequential-progression effects, our approach leverages an assumption that these effects follow a persistent trend. This work is motivated by the recent rise of single-cell RNA-sequencing experiments over a brief time course, which aim to identify genes relevant to the progression of a particular biological process across diverse cell populations. While classical statistical tools focus on scalar-response regression or order-agnostic differences between distributions, it is desirable in this setting to consider both the full distributions as well as the structure imposed by their ordering. We introduce a new regression model for ordinal covariates where responses are univariate distributions and the underlying relationship reflects consistent changes in the distributions over increasing levels of the covariate. This concept is formalized as a trend in distributions, which we define as an evolution that is linear under the Wasserstein metric. Implemented via a fast alternating projections algorithm, our method exhibits numerous strengths in simulations and analyses of single-cell gene expression data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1296-1310 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1341412 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1341412 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1296-1310 Template-Type: ReDIF-Article 1.0 Author-Name: Harry Crane Author-X-Name-First: Harry Author-X-Name-Last: Crane Author-Name: Walter Dempsey Author-X-Name-First: Walter Author-X-Name-Last: Dempsey Title: Edge Exchangeable Models for Interaction Networks Abstract: Many modern network datasets arise from processes of interactions in a population, such as phone calls, email exchanges, co-authorships, and professional collaborations. In such interaction networks, the edges comprise the fundamental statistical units, making a framework for edge-labeled networks more appropriate for statistical analysis. In this context, we initiate the study of edge exchangeable network models and explore its basic statistical properties. Several theoretical and practical features make edge exchangeable models better suited to many applications in network analysis than more common vertex-centric approaches. In particular, edge exchangeable models allow for sparse structure and power law degree distributions, both of which are widely observed empirical properties that cannot be handled naturally by more conventional approaches. Our discussion culminates in the Hollywood model, which we identify here as the canonical family of edge exchangeable distributions. The Hollywood model is computationally tractable, admits a clear interpretation, exhibits good theoretical properties, and performs reasonably well in estimation and prediction as we demonstrate on real network datasets. As a generalization of the Hollywood model, we further identify the vertex components model as a nonparametric subclass of models with a convenient stick breaking construction. Journal: Journal of the American Statistical Association Pages: 1311-1326 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1341413 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1341413 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1311-1326 Template-Type: ReDIF-Article 1.0 Author-Name: Max Sommerfeld Author-X-Name-First: Max Author-X-Name-Last: Sommerfeld Author-Name: Stephan Sain Author-X-Name-First: Stephan Author-X-Name-Last: Sain Author-Name: Armin Schwartzman Author-X-Name-First: Armin Author-X-Name-Last: Schwartzman Title: Confidence Regions for Spatial Excursion Sets From Repeated Random Field Observations, With an Application to Climate Abstract: The goal of this article is to give confidence regions for the excursion set of a spatial function above a given threshold from repeated noisy observations on a fine grid of fixed locations. Given an asymptotically Gaussian estimator of the target function, a pair of data-dependent nested excursion sets are constructed that are sub- and super-sets of the true excursion set, respectively, with a desired confidence. Asymptotic coverage probabilities are determined via a multiplier bootstrap method, not requiring Gaussianity of the original data nor stationarity or smoothness of the limiting Gaussian field. The method is used to determine regions in North America where the mean summer and winter temperatures are expected to increase by mid-21st century by more than 2 degrees Celsius. Journal: Journal of the American Statistical Association Pages: 1327-1340 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1341838 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1341838 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1327-1340 Template-Type: ReDIF-Article 1.0 Author-Name: Uri Keich Author-X-Name-First: Uri Author-X-Name-Last: Keich Author-Name: William Stafford Noble Author-X-Name-First: William Stafford Author-X-Name-Last: Noble Title: Controlling the FDR in Imperfect Matches to an Incomplete Database Abstract: We consider the problem of controlling the false discovery rate (FDR) among discoveries from searching an incomplete database. This problem differs from the classical multiple testing setting because there are two different types of false discoveries: those arising from objects that have no match in the database and those that are incorrectly matched. We show that commonly used FDR controlling procedures are inadequate for this setup, a special case of which is tandem mass spectrum identification. We then derive a novel FDR controlling approach which extensive simulations suggest is unbiased. We also compare its performance with problem-specific as well as general FDR controlling procedures using both simulated and real mass spectrometry data. Journal: Journal of the American Statistical Association Pages: 973-982 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1375931 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1375931 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:973-982 Template-Type: ReDIF-Article 1.0 Author-Name: Michele Santacatterina Author-X-Name-First: Michele Author-X-Name-Last: Santacatterina Author-Name: Matteo Bottai Author-X-Name-First: Matteo Author-X-Name-Last: Bottai Title: Optimal Probability Weights for Inference With Constrained Precision Abstract: Probability weights are used in many areas of research including complex survey designs, missing data analysis, and adjustment for confounding factors. They are useful analytic tools but can lead to statistical inefficiencies when they contain outlying values. This issue is frequently tackled by replacing large weights with smaller ones or by normalizing them through smoothing functions. While these approaches are practical, they are also prone to yield biased inferences. This article introduces a method for obtaining optimal weights, defined as those with smallest Euclidean distance from target weights among all sets of weights that satisfy a constraint on the variance of the resulting weighted estimator. The optimal weights yield minimum-bias estimators among all estimators with specified precision. The method is based on solving a constrained nonlinear optimization problem whose Lagrange multipliers and objective function can help assess the trade-off between bias and precision of the resulting weighted estimator. The finite-sample performance of the optimally weighted estimator is assessed in a simulation study, and its applicability is illustrated through an analysis of heterogeneity over age of the effect of the timing of treatment-initiation on long-term treatment efficacy in patient infected by human immunodeficiency virus in Sweden. Journal: Journal of the American Statistical Association Pages: 983-991 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1375932 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1375932 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:983-991 Template-Type: ReDIF-Article 1.0 Author-Name: Sungduk Kim Author-X-Name-First: Sungduk Author-X-Name-Last: Kim Author-Name: Paul S. Albert Author-X-Name-First: Paul S. Author-X-Name-Last: Albert Title: Latent Variable Poisson Models for Assessing the Regularity of Circadian Patterns over Time Abstract: Many researchers in biology and medicine have focused on trying to understand biological rhythms and their potential impact on disease. A common biological rhythm is circadian, where the cycle repeats itself every 24 hours. However, a disturbance of the circadian pattern may be indicative of future disease. In this article, we develop new statistical methodology for assessing the degree of disturbance or irregularity in a circadian pattern for count sequences that are observed over time in a population of individuals. We develop a latent variable Poisson modeling approach with both circadian and stochastic short-term trend (autoregressive latent process) components that allow for individual variation in the degree of each component. A parameterization is proposed for modeling covariate dependence on the proportion of these two model components across individuals. In addition, we incorporate covariate dependence in the overall mean, the magnitude of the trend, and the phase-shift of the circadian pattern. Innovative Markov chain Monte Carlo sampling is used to carry out Bayesian posterior computation. Several variations of the proposed models are considered and compared using the deviance information criterion. We illustrate this methodology with longitudinal physical activity count data measured in a longitudinal cohort of adolescents. Journal: Journal of the American Statistical Association Pages: 992-1002 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1379402 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1379402 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:992-1002 Template-Type: ReDIF-Article 1.0 Author-Name: Daniel Backenroth Author-X-Name-First: Daniel Author-X-Name-Last: Backenroth Author-Name: Jeff Goldsmith Author-X-Name-First: Jeff Author-X-Name-Last: Goldsmith Author-Name: Michelle D. Harran Author-X-Name-First: Michelle D. Author-X-Name-Last: Harran Author-Name: Juan C. Cortes Author-X-Name-First: Juan C. Author-X-Name-Last: Cortes Author-Name: John W. Krakauer Author-X-Name-First: John W. Author-X-Name-Last: Krakauer Author-Name: Tomoko Kitago Author-X-Name-First: Tomoko Author-X-Name-Last: Kitago Title: Modeling Motor Learning Using Heteroscedastic Functional Principal Components Analysis Abstract: We propose a novel method for estimating population-level and subject-specific effects of covariates on the variability of functional data. We extend the functional principal components analysis framework by modeling the variance of principal component scores as a function of covariates and subject-specific random effects. In a setting where principal components are largely invariant across subjects and covariate values, modeling the variance of these scores provides a flexible and interpretable way to explore factors that affect the variability of functional data. Our work is motivated by a novel dataset from an experiment assessing upper extremity motor control, and quantifies the reduction in movement variability associated with skill learning. The proposed methods can be applied broadly to understand movement variability, in settings that include motor learning, impairment due to injury or disease, and recovery. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1003-1015 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1379403 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1379403 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1003-1015 Template-Type: ReDIF-Article 1.0 Author-Name: Suyu Liu Author-X-Name-First: Suyu Author-X-Name-Last: Liu Author-Name: Beibei Guo Author-X-Name-First: Beibei Author-X-Name-Last: Guo Author-Name: Ying Yuan Author-X-Name-First: Ying Author-X-Name-Last: Yuan Title: A Bayesian Phase I/II Trial Design for Immunotherapy Abstract: Immunotherapy is an innovative treatment approach that stimulates a patient’s immune system to fight cancer. It demonstrates characteristics distinct from conventional chemotherapy and stands to revolutionize cancer treatment. We propose a Bayesian phase I/II dose-finding design that incorporates the unique features of immunotherapy by simultaneously considering three outcomes: immune response, toxicity, and efficacy. The objective is to identify the biologically optimal dose, defined as the dose with the highest desirability in the risk–benefit tradeoff. An Emax model is utilized to describe the marginal distribution of the immune response. Conditional on the immune response, we jointly model toxicity and efficacy using a latent variable approach. Using the accumulating data, we adaptively randomize patients to experimental doses based on the continuously updated model estimates. A simulation study shows that our proposed design has good operating characteristics in terms of selecting the target dose and allocating patients to the target dose. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1016-1027 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1383260 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1383260 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1016-1027 Template-Type: ReDIF-Article 1.0 Author-Name: Daisy Philtron Author-X-Name-First: Daisy Author-X-Name-Last: Philtron Author-Name: Yafei Lyu Author-X-Name-First: Yafei Author-X-Name-Last: Lyu Author-Name: Qunhua Li Author-X-Name-First: Qunhua Author-X-Name-Last: Li Author-Name: Debashis Ghosh Author-X-Name-First: Debashis Author-X-Name-Last: Ghosh Title: Maximum Rank Reproducibility: A Nonparametric Approach to Assessing Reproducibility in Replicate Experiments Abstract: The identification of reproducible signals from the results of replicate high-throughput experiments is an important part of modern biological research. Often little is known about the dependence structure and the marginal distribution of the data, motivating the development of a nonparametric approach to assess reproducibility. The procedure, which we call the maximum rank reproducibility (MaRR) procedure, uses a maximum rank statistic to parse reproducible signals from noise without making assumptions about the distribution of reproducible signals. Because it uses the rank scale this procedure can be easily applied to a variety of data types. One application is to assess the reproducibility of RNA-seq technology using data produced by the sequencing quality control (SEQC) consortium, which coordinated a multi-laboratory effort to assess reproducibility across three RNA-seq platforms. Our results on simulations and SEQC data show that the MaRR procedure effectively controls false discovery rates, has desirable power properties, and compares well to existing methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1028-1039 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1397521 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1397521 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1028-1039 Template-Type: ReDIF-Article 1.0 Author-Name: Adam N. Glynn Author-X-Name-First: Adam N. Author-X-Name-Last: Glynn Author-Name: Konstantin Kashin Author-X-Name-First: Konstantin Author-X-Name-Last: Kashin Title: Front-Door Versus Back-Door Adjustment With Unmeasured Confounding: Bias Formulas for Front-Door and Hybrid Adjustments With Application to a Job Training Program Abstract: We demonstrate that the front-door adjustment can be a useful alternative to standard covariate adjustments (i.e., back-door adjustments), even when the assumptions required for the front-door approach do not hold. We do this by providing asymptotic bias formulas for the front-door approach that can be compared directly to bias formulas for the back-door approach. In some cases, this allows the tightening of bounds on treatment effects. We also show that under one-sided noncompliance, the front-door approach does not rely on the use of control units. This finding has implications for the design of studies when treatment cannot be withheld from individuals (perhaps for ethical reasons). We illustrate these points with an application to the National Job Training Partnership Act Study. Journal: Journal of the American Statistical Association Pages: 1040-1049 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1398657 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1398657 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1040-1049 Template-Type: ReDIF-Article 1.0 Author-Name: Carlos M. Carvalho Author-X-Name-First: Carlos M. Author-X-Name-Last: Carvalho Author-Name: Hedibert F. Lopes Author-X-Name-First: Hedibert F. Author-X-Name-Last: Lopes Author-Name: Robert E. McCulloch Author-X-Name-First: Robert E. Author-X-Name-Last: McCulloch Title: On the Long-Run Volatility of Stocks Abstract: In this article, we investigate whether or not the volatility per period of stocks is lower over longer horizons. Taking the perspective of an investor, we evaluate the predictive variance of k-period returns under different model and prior specifications. We adopt the state-space framework of Pástor and Stambaugh to model the dynamics of expected returns and evaluate the effects of prior elicitation in the resulting volatility estimates. Part of the developments includes an extension that incorporates time-varying volatilities and covariances in a constrained prior information set-up. Our conclusion for the U.S. market, under plausible prior specifications, is that stocks are less volatile in the long run. Model assessment exercises demonstrate the models and priors supporting our main conclusions are in accordance with the data. To assess the generality of the results, we extend our analysis to a number of international equity indices. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1050-1069 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1407769 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407769 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1050-1069 Template-Type: ReDIF-Article 1.0 Author-Name: Qingyuan Zhao Author-X-Name-First: Qingyuan Author-X-Name-Last: Zhao Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Author-Name: Paul R. Rosenbaum Author-X-Name-First: Paul R. Author-X-Name-Last: Rosenbaum Title: Cross-Screening in Observational Studies That Test Many Hypotheses Abstract: We discuss observational studies that test many causal hypotheses, either hypotheses about many outcomes or many treatments. To be credible an observational study that tests many causal hypotheses must demonstrate that its conclusions are neither artifacts of multiple testing nor of small biases from nonrandom treatment assignment. In a sense that needs to be defined carefully, hidden within a sensitivity analysis for nonrandom assignment is an enormous correction for multiple testing: In the absence of bias, it is extremely improbable that multiple testing alone would create an association insensitive to moderate biases. We propose a new strategy called “cross-screening,” different from but motivated by recent work of Bogomolov and Heller on replicability. Cross-screening splits the data in half at random, uses the first half to plan a study carried out on the second half, then uses the second half to plan a study carried out on the first half, and reports the more favorable conclusions of the two studies correcting using the Bonferroni inequality for having done two studies. If the two studies happen to concur, then they achieve Bogomolov–Heller replicability; however, importantly, replicability is not required for strong control of the family-wise error rate, and either study alone suffices for firm conclusions. In randomized studies with just a few null hypotheses, cross-screening is not an attractive method when compared with conventional methods of multiplicity control. However, cross-screening has substantially higher power when hundreds or thousands of hypotheses are subjected to sensitivity analyses in an observational study of moderate size. We illustrate the technique by comparing 46 biomarkers in individuals who consume large quantities of fish versus little or no fish. The R package CrossScreening on CRAN implements the cross-screening method. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1070-1084 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1407770 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407770 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1070-1084 Template-Type: ReDIF-Article 1.0 Author-Name: Faming Liang Author-X-Name-First: Faming Author-X-Name-Last: Liang Author-Name: Qizhai Li Author-X-Name-First: Qizhai Author-X-Name-Last: Li Author-Name: Lei Zhou Author-X-Name-First: Lei Author-X-Name-Last: Zhou Title: Bayesian Neural Networks for Selection of Drug Sensitive Genes Abstract: Recent advances in high-throughput biotechnologies have provided an unprecedented opportunity for biomarker discovery, which, from a statistical point of view, can be cast as a variable selection problem. This problem is challenging due to the high-dimensional and nonlinear nature of omics data and, in general, it suffers three difficulties: (i) an unknown functional form of the nonlinear system, (ii) variable selection consistency, and (iii) high-demanding computation. To circumvent the first difficulty, we employ a feed-forward neural network to approximate the unknown nonlinear function motivated by its universal approximation ability. To circumvent the second difficulty, we conduct structure selection for the neural network, which induces variable selection, by choosing appropriate prior distributions that lead to the consistency of variable selection. To circumvent the third difficulty, we implement the population stochastic approximation Monte Carlo algorithm, a parallel adaptive Markov Chain Monte Carlo algorithm, on the OpenMP platform that provides a linear speedup for the simulation with the number of cores of the computer. The numerical results indicate that the proposed method can work very well for identification of relevant variables for high-dimensional nonlinear systems. The proposed method is successfully applied to identification of the genes that are associated with anticancer drug sensitivities based on the data collected in the cancer cell line encyclopedia study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 955-972 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2017.1409122 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1409122 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:955-972 Template-Type: ReDIF-Article 1.0 Author-Name: Jaewoo Park Author-X-Name-First: Jaewoo Author-X-Name-Last: Park Author-Name: Murali Haran Author-X-Name-First: Murali Author-X-Name-Last: Haran Title: Bayesian Inference in the Presence of Intractable Normalizing Functions Abstract: Models with intractable normalizing functions arise frequently in statistics. Common examples of such models include exponential random graph models for social networks and Markov point processes for ecology and disease modeling. Inference for these models is complicated because the normalizing functions of their probability distributions include the parameters of interest. In Bayesian analysis, they result in so-called doubly intractable posterior distributions which pose significant computational challenges. Several Monte Carlo methods have emerged in recent years to address Bayesian inference for such models. We provide a framework for understanding the algorithms, and elucidate connections among them. Through multiple simulated and real data examples, we compare and contrast the computational and statistical efficiency of these algorithms and discuss their theoretical bases. Our study provides practical recommendations for practitioners along with directions for future research for Markov chain Monte Carlo (MCMC) methodologists. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1372-1390 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2018.1448824 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1448824 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1372-1390 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Book Reviews Journal: Journal of the American Statistical Association Pages: 1391-1394 Issue: 523 Volume: 113 Year: 2018 Month: 7 X-DOI: 10.1080/01621459.2018.1513232 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1513232 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1391-1394 Template-Type: ReDIF-Article 1.0 Author-Name: Tianxi Cai Author-X-Name-First: Tianxi Author-X-Name-Last: Cai Author-Name: T. Tony Cai Author-X-Name-First: T. Tony Author-X-Name-Last: Cai Author-Name: Anru Zhang Author-X-Name-First: Anru Author-X-Name-Last: Zhang Title: Structured Matrix Completion with Applications to Genomic Data Integration Abstract: Matrix completion has attracted significant recent attention in many fields including statistics, applied mathematics, and electrical engineering. Current literature on matrix completion focuses primarily on independent sampling models under which the individual observed entries are sampled independently. Motivated by applications in genomic data integration, we propose a new framework of structured matrix completion (SMC) to treat structured missingness by design. Specifically, our proposed method aims at efficient matrix recovery when a subset of the rows and columns of an approximately low-rank matrix are observed. We provide theoretical justification for the proposed SMC method and derive lower bound for the estimation errors, which together establish the optimal rate of recovery over certain classes of approximately low-rank matrices. Simulation studies show that the method performs well in finite sample under a variety of configurations. The method is applied to integrate several ovarian cancer genomic studies with different extent of genomic measurements, which enables us to construct more accurate prediction rules for ovarian cancer survival. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 621-633 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1021005 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1021005 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:621-633 Template-Type: ReDIF-Article 1.0 Author-Name: Chris J. Oates Author-X-Name-First: Chris J. Author-X-Name-Last: Oates Author-Name: Theodore Papamarkou Author-X-Name-First: Theodore Author-X-Name-Last: Papamarkou Author-Name: Mark Girolami Author-X-Name-First: Mark Author-X-Name-Last: Girolami Title: The Controlled Thermodynamic Integral for Bayesian Model Evidence Evaluation Abstract: Approximation of the model evidence is well known to be challenging. One promising approach is based on thermodynamic integration, but a key concern is that the thermodynamic integral can suffer from high variability in many applications. This article considers the reduction of variance that can be achieved by exploiting control variates in this setting. Our methodology applies whenever the gradient of both the log-likelihood and the log-prior with respect to the parameters can be efficiently evaluated. Results obtained on regression models and popular benchmark datasets demonstrate a significant and sometimes dramatic reduction in estimator variance and provide insight into the wider applicability of control variates to evidence estimation. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 634-645 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1021006 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1021006 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:634-645 Template-Type: ReDIF-Article 1.0 Author-Name: Shuyuan He Author-X-Name-First: Shuyuan Author-X-Name-Last: He Author-Name: Wei Liang Author-X-Name-First: Wei Author-X-Name-Last: Liang Author-Name: Junshan Shen Author-X-Name-First: Junshan Author-X-Name-Last: Shen Author-Name: Grace Yang Author-X-Name-First: Grace Author-X-Name-Last: Yang Title: Empirical Likelihood for Right Censored Lifetime Data Abstract: When the empirical likelihood (EL) of a parameter θ is constructed with right censored data, literature shows that − 2log (empirical likelihood ratio) typically has an asymptotic scaled chi-squared distribution, where the scale parameter is a function of some unknown asymptotic variances. Therefore, the EL construction of confidence intervals for θ requires an additional estimation of the scale parameter. Additional estimation would reduce the coverage accuracy for θ. By using a special influence function as an estimating function, we prove that under very general conditions, − 2log (empirical likelihood ratio) has an asymptotic standard chi-squared distribution with one degree of freedom. This eliminates the need for estimating the scale parameter as well as eases some of the often demanding computations of the EL method. Our estimating function yields a smaller asymptotic variance than those of Wang and Jing (2001) and Qin and Zhao (2007). Thus, it is not surprising that confidence intervals using the special influence functions give a better coverage accuracy as demonstrated by simulations. Journal: Journal of the American Statistical Association Pages: 646-655 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1024058 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1024058 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:646-655 Template-Type: ReDIF-Article 1.0 Author-Name: Yun Yang Author-X-Name-First: Yun Author-X-Name-Last: Yang Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Bayesian Conditional Tensor Factorizations for High-Dimensional Classification Abstract: In many application areas, data are collected on a categorical response and high-dimensional categorical predictors, with the goals being to build a parsimonious model for classification while doing inferences on the important predictors. In settings such as genomics, there can be complex interactions among the predictors. By using a carefully structured Tucker factorization, we define a model that can characterize any conditional probability, while facilitating variable selection and modeling of higher-order interactions. Following a Bayesian approach, we propose a Markov chain Monte Carlo algorithm for posterior computation accommodating uncertainty in the predictors to be included. Under near low-rank assumptions, the posterior distribution for the conditional probability is shown to achieve close to the parametric rate of contraction even in ultra high-dimensional settings. The methods are illustrated using simulation examples and biomedical applications. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 656-669 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1029129 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1029129 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:656-669 Template-Type: ReDIF-Article 1.0 Author-Name: Michael W. Robbins Author-X-Name-First: Michael W. Author-X-Name-Last: Robbins Author-Name: Colin M. Gallagher Author-X-Name-First: Colin M. Author-X-Name-Last: Gallagher Author-Name: Robert B. Lund Author-X-Name-First: Robert B. Author-X-Name-Last: Lund Title: A General Regression Changepoint Test for Time Series Data Abstract: This article develops a test for a single changepoint in a general setting that allows for correlated time series regression errors, a seasonal cycle, time-varying regression factors, and covariate information. Within, a changepoint statistic is constructed from likelihood ratio principles and its asymptotic distribution is derived. The asymptotic distribution of the changepoint statistic is shown to be invariant of the seasonal cycle and the covariates should the latter obey some simple limit laws; however, the limit distribution depends on any time-varying factors. A new test based on ARMA residuals is developed and is shown to have favorable properties with finite samples. Driving our work is a changepoint analysis of the Mauna Loa record of monthly carbon dioxide concentrations. This series has a pronounced seasonal cycle, a nonlinear trend, heavily correlated regression errors, and covariate information in the form of climate oscillations. In the end, we find a prominent changepoint in the early 1990s, often attributed to the eruption of Mount Pinatubo, which cannot be explained by covariates. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 670-683 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1029130 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1029130 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:670-683 Template-Type: ReDIF-Article 1.0 Author-Name: Ying Yan Author-X-Name-First: Ying Author-X-Name-Last: Yan Author-Name: Grace Y. Yi Author-X-Name-First: Grace Y. Author-X-Name-Last: Yi Title: A Class of Functional Methods for Error-Contaminated Survival Data Under Additive Hazards Models with Replicate Measurements Abstract: Covariate measurement error has attracted extensive interest in survival analysis. Since Prentice, a large number of inference methods have been developed to handle error-prone data that are modulated with proportional hazards models. In contrast to proportional hazards models, additive hazards models offer a flexible tool to delineate survival processes. However, there is little research on measurement error effects under additive hazards models. In this article, we systematically investigate this important problem. New insights into measurement error effects are revealed, as opposed to well-documented results for proportional hazards models. In particular, we explore asymptotic bias of ignoring measurement error in the analysis. To correct for the induced bias, we develop a class of functional correction methods for measurement error effects. The validity of the proposed methods is carefully examined, and we investigate issues of model checking and model misspecification. Theoretical results are established, and are complemented with numerical assessments. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 684-695 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1034317 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1034317 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:684-695 Template-Type: ReDIF-Article 1.0 Author-Name: Irina Gaynanova Author-X-Name-First: Irina Author-X-Name-Last: Gaynanova Author-Name: James G. Booth Author-X-Name-First: James G. Author-X-Name-Last: Booth Author-Name: Martin T. Wells Author-X-Name-First: Martin T. Author-X-Name-Last: Wells Title: Simultaneous Sparse Estimation of Canonical Vectors in the ≫ Setting Abstract: This article considers the problem of sparse estimation of canonical vectors in linear discriminant analysis when p ≫ N. Several methods have been proposed in the literature that estimate one canonical vector in the two-group case. However, G − 1 canonical vectors can be considered if the number of groups is G. In the multi-group context, it is common to estimate canonical vectors in a sequential fashion. Moreover, separate prior estimation of the covariance structure is often required. We propose a novel methodology for direct estimation of canonical vectors. In contrast to existing techniques, the proposed method estimates all canonical vectors at once, performs variable selection across all the vectors and comes with theoretical guarantees on the variable selection and classification consistency. First, we highlight the fact that in the N > p setting the canonical vectors can be expressed in a closed form up to an orthogonal transformation. Secondly, we propose an extension of this form to the p ≫ N setting and achieve feature selection by using a group penalty. The resulting optimization problem is convex and can be solved using a block-coordinate descent algorithm. The practical performance of the method is evaluated through simulation studies as well as real data applications. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 696-706 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1034318 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1034318 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:696-706 Template-Type: ReDIF-Article 1.0 Author-Name: Guan Yu Author-X-Name-First: Guan Author-X-Name-Last: Yu Author-Name: Yufeng Liu Author-X-Name-First: Yufeng Author-X-Name-Last: Liu Title: Sparse Regression Incorporating Graphical Structure Among Predictors Abstract: With the abundance of high-dimensional data in various disciplines, sparse regularized techniques are very popular these days. In this article, we make use of the structure information among predictors to improve sparse regression models. Typically, such structure information can be modeled by the connectivity of an undirected graph using all predictors as nodes of the graph. Most existing methods use this undirected graph edge-by-edge to encourage the regression coefficients of corresponding connected predictors to be similar. However, such methods do not directly use the neighborhood information of the graph. Furthermore, if there are more edges in the predictor graph, the corresponding regularization term will be more complicated. In this article, we incorporate the graph information node-by-node, instead of edge-by-edge as used in most existing methods. Our proposed method is very general and it includes adaptive Lasso, group Lasso, and ridge regression as special cases. Both theoretical and numerical studies demonstrate the effectiveness of the proposed method for simultaneous estimation, prediction, and model selection. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 707-720 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1034319 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1034319 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:707-720 Template-Type: ReDIF-Article 1.0 Author-Name: Long Feng Author-X-Name-First: Long Author-X-Name-Last: Feng Author-Name: Changliang Zou Author-X-Name-First: Changliang Author-X-Name-Last: Zou Author-Name: Zhaojun Wang Author-X-Name-First: Zhaojun Author-X-Name-Last: Wang Title: Multivariate-Sign-Based High-Dimensional Tests for the Two-Sample Location Problem Abstract: This article concerns tests for the two-sample location problem when data dimension is larger than the sample size. Existing multivariate-sign-based procedures are not robust against high dimensionality, producing tests with Type I error rates far away from nominal levels. This is mainly due to the biases from estimating location parameters. We propose a novel test to overcome this issue by using the “leave-one-out” idea. The proposed test statistic is scalar-invariant and thus is particularly useful when different components have different scales in high-dimensional data. Asymptotic properties of the test statistic are studied. Compared with other existing approaches, simulation studies show that the proposed method behaves well in terms of sizes and power. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 721-735 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1035380 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1035380 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:721-735 Template-Type: ReDIF-Article 1.0 Author-Name: Jin Tang Author-X-Name-First: Jin Author-X-Name-Last: Tang Author-Name: Yehua Li Author-X-Name-First: Yehua Author-X-Name-Last: Li Author-Name: Yongtao Guan Author-X-Name-First: Yongtao Author-X-Name-Last: Guan Title: Generalized Quasi-Likelihood Ratio Tests for Semiparametric Analysis of Covariance Models in Longitudinal Data Abstract: We model generalized longitudinal data from multiple treatment groups by a class of semiparametric analysis of covariance models, which take into account the parametric effects of time dependent covariates and the nonparametric time effects. In these models, the treatment effects are represented by nonparametric functions of time and we propose a generalized quasi-likelihood ratio test procedure to test if these functions are identical. Our estimation procedure is based on profile estimating equations combined with local linear smoothers. We find that the much celebrated Wilks phenomenon which is well established for independent data still holds for longitudinal data if a working independence correlation structure is assumed in the test statistic. However, this property does not hold in general, especially when the working variance function is misspecified. Our empirical study also shows that incorporating correlation into the test statistic does not necessarily improve the power of the test. The proposed methods are illustrated with simulation studies and a real application from opioid dependence treatments. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 736-747 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1036995 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1036995 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:736-747 Template-Type: ReDIF-Article 1.0 Author-Name: Pete Bunch Author-X-Name-First: Pete Author-X-Name-Last: Bunch Author-Name: Simon Godsill Author-X-Name-First: Simon Author-X-Name-Last: Godsill Title: Approximations of the Optimal Importance Density Using Gaussian Particle Flow Importance Sampling Abstract: Recently developed particle flow algorithms provide an alternative to importance sampling for drawing particles from a posterior distribution, and a number of particle filters based on this principle have been proposed. Samples are drawn from the prior and then moved according to some dynamics over an interval of pseudo-time such that their final values are distributed according to the desired posterior. In practice, implementing a particle flow sampler requires multiple layers of approximation, with the result that the final samples do not in general have the correct posterior distribution. In this article we consider using an approximate Gaussian flow for sampling with a class of nonlinear Gaussian models. We use the particle flow within an importance sampler, correcting for the discrepancy between the target and actual densities with importance weights. We present a suitable numerical integration procedure for use with this flow and an accompanying step-size control algorithm. In a filtering context, we use the particle flow to sample from the optimal importance density, rather than the filtering density itself, avoiding the need to make analytical or numerical approximations of the predictive density. Simulations using particle flow importance sampling within a particle filter demonstrate significant improvement over standard approximations of the optimal importance density, and the algorithm falls within the standard sequential Monte Carlo framework. Journal: Journal of the American Statistical Association Pages: 748-762 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1038387 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1038387 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:748-762 Template-Type: ReDIF-Article 1.0 Author-Name: Yiyuan She Author-X-Name-First: Yiyuan Author-X-Name-Last: She Author-Name: Shijie Li Author-X-Name-First: Shijie Author-X-Name-Last: Li Author-Name: Dapeng Wu Author-X-Name-First: Dapeng Author-X-Name-Last: Wu Title: Robust Orthogonal Complement Principal Component Analysis Abstract: Recently, the robustification of principal component analysis (PCA) has attracted lots of attention from statisticians, engineers, and computer scientists. In this work, we study the type of outliers that are not necessarily apparent in the original observation space but can seriously affect the principal subspace estimation. Based on a mathematical formulation of such transformed outliers, a novel robust orthogonal complement principal component analysis (ROC-PCA) is proposed. The framework combines the popular sparsity-enforcing and low-rank regularization techniques to deal with row-wise outliers as well as element-wise outliers. A nonasymptotic oracle inequality guarantees the accuracy and high breakdown performance of ROC-PCA in finite samples. To tackle the computational challenges, an efficient algorithm is developed on the basis of Stiefel manifold optimization and iterative thresholding. Furthermore, a batch variant is proposed to significantly reduce the cost in ultra high dimensions. The article also points out a pitfall of a common practice of singular value decomposition (SVD) reduction in robust PCA. Experiments show the effectiveness and efficiency of ROC-PCA in both synthetic and real data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 763-771 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1042107 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1042107 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:763-771 Template-Type: ReDIF-Article 1.0 Author-Name: Lin Zhang Author-X-Name-First: Lin Author-X-Name-Last: Zhang Author-Name: Veerabhadran Baladandayuthapani Author-X-Name-First: Veerabhadran Author-X-Name-Last: Baladandayuthapani Author-Name: Hongxiao Zhu Author-X-Name-First: Hongxiao Author-X-Name-Last: Zhu Author-Name: Keith A. Baggerly Author-X-Name-First: Keith A. Author-X-Name-Last: Baggerly Author-Name: Tadeusz Majewski Author-X-Name-First: Tadeusz Author-X-Name-Last: Majewski Author-Name: Bogdan A. Czerniak Author-X-Name-First: Bogdan A. Author-X-Name-Last: Czerniak Author-Name: Jeffrey S. Morris Author-X-Name-First: Jeffrey S. Author-X-Name-Last: Morris Title: Functional CAR Models for Large Spatially Correlated Functional Datasets Abstract: We develop a functional conditional autoregressive (CAR) model for spatially correlated data for which functions are collected on areal units of a lattice. Our model performs functional response regression while accounting for spatial correlations with potentially nonseparable and nonstationary covariance structure, in both the space and functional domains. We show theoretically that our construction leads to a CAR model at each functional location, with spatial covariance parameters varying and borrowing strength across the functional domain. Using basis transformation strategies, the nonseparable spatial-functional model is computationally scalable to enormous functional datasets, generalizable to different basis functions, and can be used on functions defined on higher dimensional domains such as images. Through simulation studies, we demonstrate that accounting for the spatial correlation in our modeling leads to improved functional regression performance. Applied to a high-throughput spatially correlated copy number dataset, the model identifies genetic markers not identified by comparable methods that ignore spatial correlations. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 772-786 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1042581 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1042581 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:772-786 Template-Type: ReDIF-Article 1.0 Author-Name: Chiung-Yu Huang Author-X-Name-First: Chiung-Yu Author-X-Name-Last: Huang Author-Name: Jing Qin Author-X-Name-First: Jing Author-X-Name-Last: Qin Author-Name: Huei-Ting Tsai Author-X-Name-First: Huei-Ting Author-X-Name-Last: Tsai Title: Efficient Estimation of the Cox Model with Auxiliary Subgroup Survival Information Abstract: With the rapidly increasing availability of data in the public domain, combining information from different sources to infer about associations or differences of interest has become an emerging challenge to researchers. This article presents a novel approach to improve efficiency in estimating the survival time distribution by synthesizing information from the individual-level data with t-year survival probabilities from external sources such as disease registries. While disease registries provide accurate and reliable overall survival statistics for the disease population, critical pieces of information that influence both choice of treatment and clinical outcomes usually are not available in the registry database. To combine with the published information, we propose to summarize the external survival information via a system of nonlinear population moments and estimate the survival time model using empirical likelihood methods. The proposed approach is more flexible than the conventional meta-analysis in the sense that it can automatically combine survival information for different subgroups and the information may be derived from different studies. Moreover, an extended estimator that allows for a different baseline risk in the aggregate data is also studied. Empirical likelihood ratio tests are proposed to examine whether the auxiliary survival information is consistent with the individual-level data. Simulation studies show that the proposed estimators yield a substantial gain in efficiency over the conventional partial likelihood approach. Two sets of data analysis are conducted to illustrate the methods and theory. Journal: Journal of the American Statistical Association Pages: 787-799 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1044090 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1044090 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:787-799 Template-Type: ReDIF-Article 1.0 Author-Name: Abhirup Datta Author-X-Name-First: Abhirup Author-X-Name-Last: Datta Author-Name: Sudipto Banerjee Author-X-Name-First: Sudipto Author-X-Name-Last: Banerjee Author-Name: Andrew O. Finley Author-X-Name-First: Andrew O. Author-X-Name-Last: Finley Author-Name: Alan E. Gelfand Author-X-Name-First: Alan E. Author-X-Name-Last: Gelfand Title: Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets Abstract: Spatial process models for analyzing geostatistical data entail computations that become prohibitive as the number of spatial locations become large. This article develops a class of highly scalable nearest-neighbor Gaussian process (NNGP) models to provide fully model-based inference for large geostatistical datasets. We establish that the NNGP is a well-defined spatial process providing legitimate finite-dimensional Gaussian densities with sparse precision matrices. We embed the NNGP as a sparsity-inducing prior within a rich hierarchical modeling framework and outline how computationally efficient Markov chain Monte Carlo (MCMC) algorithms can be executed without storing or decomposing large matrices. The floating point operations (flops) per iteration of this algorithm is linear in the number of spatial locations, thereby rendering substantial scalability. We illustrate the computational and inferential benefits of the NNGP over competing methods using simulation studies and also analyze forest biomass from a massive U.S. Forest Inventory dataset at a scale that precludes alternative dimension-reducing methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 800-812 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1044091 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1044091 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:800-812 Template-Type: ReDIF-Article 1.0 Author-Name: Zhou Yu Author-X-Name-First: Zhou Author-X-Name-Last: Yu Author-Name: Yuexiao Dong Author-X-Name-First: Yuexiao Author-X-Name-Last: Dong Author-Name: Li-Xing Zhu Author-X-Name-First: Li-Xing Author-X-Name-Last: Zhu Title: Trace Pursuit: A General Framework for Model-Free Variable Selection Abstract: We propose trace pursuit for model-free variable selection under the sufficient dimension-reduction paradigm. Two distinct algorithms are proposed: stepwise trace pursuit and forward trace pursuit. Stepwise trace pursuit achieves selection consistency with fixed p. Forward trace pursuit can serve as an initial screening step to speed up the computation in the case of ultrahigh dimensionality. The screening consistency property of forward trace pursuit based on sliced inverse regression is established. Finite sample performances of trace pursuit and other model-free variable selection methods are compared through numerical studies. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 813-821 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1050494 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1050494 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:813-821 Template-Type: ReDIF-Article 1.0 Author-Name: F. Jay Breidt Author-X-Name-First: F. Jay Author-X-Name-Last: Breidt Author-Name: Jean D. Opsomer Author-X-Name-First: Jean D. Author-X-Name-Last: Opsomer Author-Name: Ismael Sanchez-Borrego Author-X-Name-First: Ismael Author-X-Name-Last: Sanchez-Borrego Title: Nonparametric Variance Estimation Under Fine Stratification: An Alternative to Collapsed Strata Abstract: Fine stratification is commonly used to control the distribution of a sample from a finite population and to improve the precision of resulting estimators. One-per-stratum designs represent the finest possible stratification and occur in practice, but designs with very low numbers of elements per stratum (say, two or three) are also common. The classical variance estimator in this context is the collapsed stratum estimator, which relies on creating larger “pseudo-strata” and computing the sum of the squared differences between estimated stratum totals across the pseudo-strata. We propose here a nonparametric alternative that replaces the pseudo-strata by kernel-weighted stratum neighborhoods and uses deviations from a fitted mean function to estimate the variance. We establish the asymptotic behavior of the kernel-based estimator and show its superior practical performance relative to the collapsed stratum variance estimator in a simulation study. An application to data from the U.S. Consumer Expenditure Survey illustrates the potential of the method in practice. Journal: Journal of the American Statistical Association Pages: 822-833 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1058264 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1058264 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:822-833 Template-Type: ReDIF-Article 1.0 Author-Name: Jacob Bien Author-X-Name-First: Jacob Author-X-Name-Last: Bien Author-Name: Florentina Bunea Author-X-Name-First: Florentina Author-X-Name-Last: Bunea Author-Name: Luo Xiao Author-X-Name-First: Luo Author-X-Name-Last: Xiao Title: Convex Banding of the Covariance Matrix Abstract: We introduce a new sparse estimator of the covariance matrix for high-dimensional models in which the variables have a known ordering. Our estimator, which is the solution to a convex optimization problem, is equivalently expressed as an estimator that tapers the sample covariance matrix by a Toeplitz, sparsely banded, data-adaptive matrix. As a result of this adaptivity, the convex banding estimator enjoys theoretical optimality properties not attained by previous banding or tapered estimators. In particular, our convex banding estimator is minimax rate adaptive in Frobenius and operator norms, up to log factors, over commonly studied classes of covariance matrices, and over more general classes. Furthermore, it correctly recovers the bandwidth when the true covariance is exactly banded. Our convex formulation admits a simple and efficient algorithm. Empirical studies demonstrate its practical effectiveness and illustrate that our exactly banded estimator works well even when the true covariance matrix is only close to a banded matrix, confirming our theoretical results. Our method compares favorably with all existing methods, in terms of accuracy and speed. We illustrate the practical merits of the convex banding estimator by showing that it can be used to improve the performance of discriminant analysis for classifying sound recordings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 834-845 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1058265 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1058265 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:834-845 Template-Type: ReDIF-Article 1.0 Author-Name: Aaron Fisher Author-X-Name-First: Aaron Author-X-Name-Last: Fisher Author-Name: Brian Caffo Author-X-Name-First: Brian Author-X-Name-Last: Caffo Author-Name: Brian Schwartz Author-X-Name-First: Brian Author-X-Name-Last: Schwartz Author-Name: Vadim Zipunnikov Author-X-Name-First: Vadim Author-X-Name-Last: Zipunnikov Title: Fast, Exact Bootstrap Principal Component Analysis for > 1 Million Abstract: Many have suggested a bootstrap procedure for estimating the sampling variability of principal component analysis (PCA) results. However, when the number of measurements per subject (p) is much larger than the number of subjects (n), calculating and storing the leading principal components (PCs) from each bootstrap sample can be computationally infeasible. To address this, we outline methods for fast, exact calculation of bootstrap PCs, eigenvalues, and scores. Our methods leverage the fact that all bootstrap samples occupy the same n-dimensional subspace as the original sample. As a result, all bootstrap PCs are limited to the same n-dimensional subspace and can be efficiently represented by their low-dimensional coordinates in that subspace. Several uncertainty metrics can be computed solely based on the bootstrap distribution of these low-dimensional coordinates, without calculating or storing the p-dimensional bootstrap components. Fast bootstrap PCA is applied to a dataset of sleep electroencephalogram recordings (p = 900, n = 392), and to a dataset of brain magnetic resonance images (MRIs) (p ≈ 3 million, n = 352). For the MRI dataset, our method allows for standard errors for the first three PCs based on 1000 bootstrap samples to be calculated on a standard laptop in 47 min, as opposed to approximately 4 days with standard methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 846-860 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1062383 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1062383 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:846-860 Template-Type: ReDIF-Article 1.0 Author-Name: Baojiang Chen Author-X-Name-First: Baojiang Author-X-Name-Last: Chen Author-Name: Pengfei Li Author-X-Name-First: Pengfei Author-X-Name-Last: Li Author-Name: Jing Qin Author-X-Name-First: Jing Author-X-Name-Last: Qin Author-Name: Tao Yu Author-X-Name-First: Tao Author-X-Name-Last: Yu Title: Using a Monotonic Density Ratio Model to Find the Asymptotically Optimal Combination of Multiple Diagnostic Tests Abstract: With the advent of new technology, new biomarker studies have become essential in cancer research. To achieve optimal sensitivity and specificity, one needs to combine different diagnostic tests. The celebrated Neyman–Pearson lemma enables us to use the density ratio to optimally combine different diagnostic tests. In this article, we propose a semiparametric model by directly modeling the density ratio between the diseased and nondiseased population as an unspecified monotonic nondecreasing function of a linear or nonlinear combination of multiple diagnostic tests. This method is appealing in that it is not necessary to assume separate models for the diseased and nondiseased populations. Further, the proposed method provides an asymptotically optimal way to combine multiple test results. We use a pool-adjacent-violation-algorithm to find the semiparametric maximum likelihood estimate of the receiver operating characteristic (ROC) curve. Using modern empirical process theory we show cubic root n consistency for the ROC curve and the underlying Euclidean parameter estimation. Extensive simulations show that the proposed method outperforms its competitors. We apply the method to two real-data applications. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 861-874 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1066681 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1066681 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:861-874 Template-Type: ReDIF-Article 1.0 Author-Name: Stanislav Minsker Author-X-Name-First: Stanislav Author-X-Name-Last: Minsker Author-Name: Ying-Qi Zhao Author-X-Name-First: Ying-Qi Author-X-Name-Last: Zhao Author-Name: Guang Cheng Author-X-Name-First: Guang Author-X-Name-Last: Cheng Title: Active Clinical Trials for Personalized Medicine Abstract: Individualized treatment rules (ITRs) tailor treatments according to individual patient characteristics. They can significantly improve patient care and are thus becoming increasingly popular. The data collected during randomized clinical trials are often used to estimate the optimal ITRs. However, these trials are generally expensive to run, and, moreover, they are not designed to efficiently estimate ITRs. In this article, we propose a cost-effective estimation method from an active learning perspective. In particular, our method recruits only the “most informative” patients (in terms of learning the optimal ITRs) from an ongoing clinical trial. Simulation studies and real-data examples show that our active clinical trial method significantly improves on competing methods. We derive risk bounds and show that they support these observed empirical advantages. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 875-887 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1066682 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1066682 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:875-887 Template-Type: ReDIF-Article 1.0 Author-Name: Emilio Porcu Author-X-Name-First: Emilio Author-X-Name-Last: Porcu Author-Name: Moreno Bevilacqua Author-X-Name-First: Moreno Author-X-Name-Last: Bevilacqua Author-Name: Marc G. Genton Author-X-Name-First: Marc G. Author-X-Name-Last: Genton Title: Spatio-Temporal Covariance and Cross-Covariance Functions of the Great Circle Distance on a Sphere Abstract: In this article, we propose stationary covariance functions for processes that evolve temporally over a sphere, as well as cross-covariance functions for multivariate random fields defined over a sphere. For such processes, the great circle distance is the natural metric that should be used to describe spatial dependence. Given the mathematical difficulties for the construction of covariance functions for processes defined over spheres cross time, approximations of the state of nature have been proposed in the literature by using the Euclidean (based on map projections) and the chordal distances. We present several methods of construction based on the great circle distance and provide closed-form expressions for both spatio-temporal and multivariate cases. A simulation study assesses the discrepancy between the great circle distance, chordal distance, and Euclidean distance based on a map projection both in terms of estimation and prediction in a space-time and a bivariate spatial setting, where the space is in this case the Earth. We revisit the analysis of Total Ozone Mapping Spectrometer (TOMS) data and investigate differences in terms of estimation and prediction between the aforementioned distance-based approaches. Both simulation and real data highlight sensible differences in terms of estimation of the spatial scale parameter. As far as prediction is concerned, the differences can be appreciated only when the interpoint distances are large, as demonstrated by an illustrative example. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 888-898 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1072541 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1072541 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:888-898 Template-Type: ReDIF-Article 1.0 Author-Name: Ryan J. Tibshirani Author-X-Name-First: Ryan J. Author-X-Name-Last: Tibshirani Author-Name: Jonathan Taylor Author-X-Name-First: Jonathan Author-X-Name-Last: Taylor Author-Name: Richard Lockhart Author-X-Name-First: Richard Author-X-Name-Last: Lockhart Author-Name: Robert Tibshirani Author-X-Name-First: Robert Author-X-Name-Last: Tibshirani Title: Exact Post-Selection Inference for Sequential Regression Procedures Abstract: We propose new inference tools for forward stepwise regression, least angle regression, and the lasso. Assuming a Gaussian model for the observation vector y, we first describe a general scheme to perform valid inference after any selection event that can be characterized as y falling into a polyhedral set. This framework allows us to derive conditional (post-selection) hypothesis tests at any step of forward stepwise or least angle regression, or any step along the lasso regularization path, because, as it turns out, selection events for these procedures can be expressed as polyhedral constraints on y. The p-values associated with these tests are exactly uniform under the null distribution, in finite samples, yielding exact Type I error control. The tests can also be inverted to produce confidence intervals for appropriate underlying regression parameters. The R package selectiveInference, freely available on the CRAN repository, implements the new inference tools described in this article. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 600-620 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1108848 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1108848 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:600-620 Template-Type: ReDIF-Article 1.0 Author-Name: Colin B. Fogarty Author-X-Name-First: Colin B. Author-X-Name-Last: Fogarty Author-Name: Mark E. Mikkelsen Author-X-Name-First: Mark E. Author-X-Name-Last: Mikkelsen Author-Name: David F. Gaieski Author-X-Name-First: David F. Author-X-Name-Last: Gaieski Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Title: Discrete Optimization for Interpretable Study Populations and Randomization Inference in an Observational Study of Severe Sepsis Mortality Abstract: Motivated by an observational study of the effect of hospital ward versus intensive care unit admission on severe sepsis mortality, we develop methods to address two common problems in observational studies: (1) when there is a lack of covariate overlap between the treated and control groups, how to define an interpretable study population wherein inference can be conducted without extrapolating with respect to important variables; and (2) how to use randomization inference to form confidence intervals for the average treatment effect with binary outcomes. Our solution to problem (1) incorporates existing suggestions in the literature while yielding a study population that is easily understood in terms of the covariates themselves, and can be solved using an efficient branch-and-bound algorithm. We address problem (2) by solving a linear integer program to use the worst-case variance of the average treatment effect among values for unobserved potential outcomes that are compatible with the null hypothesis. Our analysis finds no evidence for a difference between the 60-day mortality rates if all individuals were admitted to the ICU and if all patients were admitted to the hospital ward among less severely ill patients and among patients with cryptic septic shock. We implement our methodology in R, providing scripts in the supplementary material. Journal: Journal of the American Statistical Association Pages: 447-458 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1112802 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1112802 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:447-458 Template-Type: ReDIF-Article 1.0 Author-Name: Bo Zhou Author-X-Name-First: Bo Author-X-Name-Last: Zhou Author-Name: David E. Moorman Author-X-Name-First: David E. Author-X-Name-Last: Moorman Author-Name: Sam Behseta Author-X-Name-First: Sam Author-X-Name-Last: Behseta Author-Name: Hernando Ombao Author-X-Name-First: Hernando Author-X-Name-Last: Ombao Author-Name: Babak Shahbaba Author-X-Name-First: Babak Author-X-Name-Last: Shahbaba Title: A Dynamic Bayesian Model for Characterizing Cross-Neuronal Interactions During Decision-Making Abstract: The goal of this article is to develop a novel statistical model for studying cross-neuronal spike train interactions during decision-making. For an individual to successfully complete the task of decision-making, a number of temporally organized events must occur: stimuli must be detected, potential outcomes must be evaluated, behaviors must be executed or inhibited, and outcomes (such as reward or no-reward) must be experienced. Due to the complexity of this process, it is likely the case that decision-making is encoded by the temporally precise interactions between large populations of neurons. Most existing statistical models, however, are inadequate for analyzing such a phenomenon because they provide only an aggregated measure of interactions over time. To address this considerable limitation, we propose a dynamic Bayesian model that captures the time-varying nature of neuronal activity (such as the time-varying strength of the interactions between neurons). The proposed method yielded results that reveal new insight into the dynamic nature of population coding in the prefrontal cortex during decision-making. In our analysis, we note that while some neurons in the prefrontal cortex do not synchronize their firing activity until the presence of a reward, a different set of neurons synchronizes their activity shortly after stimulus onset. These differentially synchronizing subpopulations of neurons suggest a continuum of population representation of the reward-seeking task. Second, our analyses also suggest that the degree of synchronization differs between the rewarded and nonrewarded conditions. Moreover, the proposed model is scalable to handle data on many simultaneously recorded neurons and is applicable to analyzing other types of multivariate time series data with latent structure. Supplementary materials (including computer codes) for our article are available online. Journal: Journal of the American Statistical Association Pages: 459-471 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1116988 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1116988 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:459-471 Template-Type: ReDIF-Article 1.0 Author-Name: Jonathan R. Bradley Author-X-Name-First: Jonathan R. Author-X-Name-Last: Bradley Author-Name: Christopher K. Wikle Author-X-Name-First: Christopher K. Author-X-Name-Last: Wikle Author-Name: Scott H. Holan Author-X-Name-First: Scott H. Author-X-Name-Last: Holan Title: Bayesian Spatial Change of Support for Count-Valued Survey Data With Application to the American Community Survey Abstract: We introduce Bayesian spatial change of support (COS) methodology for count-valued survey data with known survey variances. Our proposed methodology is motivated by the American Community Survey (ACS), an ongoing survey administered by the U.S. Census Bureau that provides timely information on several key demographic variables. Specifically, the ACS produces 1-year, 3-year, and 5-year “period-estimates,” and corresponding margins of errors, for published demographic and socio-economic variables recorded over predefined geographies within the United States. Despite the availability of these predefined geographies, it is often of interest to data-users to specify customized user-defined spatial supports. In particular, it is useful to estimate demographic variables defined on “new” spatial supports in “real-time.” This problem is known as spatial COS, which is typically performed under the assumption that the data follow a Gaussian distribution. However, count-valued survey data is naturally non-Gaussian and, hence, we consider modeling these data using a Poisson distribution. Additionally, survey-data are often accompanied by estimates of error, which we incorporate into our analysis. We interpret Poisson count-valued data in small areas as an aggregation of events from a spatial point process. This approach provides us with the flexibility necessary to allow ACS users to consider a variety of spatial supports in “real-time.” We show the effectiveness of our approach through a simulated example as well as through an analysis using public-use ACS data. Journal: Journal of the American Statistical Association Pages: 472-487 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1117471 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1117471 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:472-487 Template-Type: ReDIF-Article 1.0 Author-Name: Angela Noufaily Author-X-Name-First: Angela Author-X-Name-Last: Noufaily Author-Name: Paddy Farrington Author-X-Name-First: Paddy Author-X-Name-Last: Farrington Author-Name: Paul Garthwaite Author-X-Name-First: Paul Author-X-Name-Last: Garthwaite Author-Name: Doyo Gragn Enki Author-X-Name-First: Doyo Gragn Author-X-Name-Last: Enki Author-Name: Nick Andrews Author-X-Name-First: Nick Author-X-Name-Last: Andrews Author-Name: Andre Charlett Author-X-Name-First: Andre Author-X-Name-Last: Charlett Title: Detection of Infectious Disease Outbreaks From Laboratory Data With Reporting Delays Abstract: Many statistical surveillance systems for the timely detection of outbreaks of infectious disease operate on laboratory data. Such data typically incur reporting delays between the time at which a specimen is collected for diagnostic purposes, and the time at which the results of the laboratory analysis become available. Statistical surveillance systems currently in use usually make some ad hoc adjustment for such delays, or use counts by time of report. We propose a new statistical approach that takes account of the delays explicitly, by monitoring the number of specimens identified in the current and past m time units, where m is a tuning parameter. Values expected in the absence of an outbreak are estimated from counts observed in recent years (typically 5 years). We study the method in the context of an outbreak detection system used in the United Kingdom and several other European countries. We propose a suitable test statistic for the null hypothesis that no outbreak is currently occurring. We derive its null variance, incorporating uncertainty about the estimated delay distribution. Simulations and applications to some test datasets suggest the method works well, and can improve performance over ad hoc methods in current use. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 488-499 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1119047 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1119047 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:488-499 Template-Type: ReDIF-Article 1.0 Author-Name: Matthew Plumlee Author-X-Name-First: Matthew Author-X-Name-Last: Plumlee Author-Name: V. Roshan Joseph Author-X-Name-First: V. Roshan Author-X-Name-Last: Joseph Author-Name: Hui Yang Author-X-Name-First: Hui Author-X-Name-Last: Yang Title: Calibrating Functional Parameters in the Ion Channel Models of Cardiac Cells Abstract: Computational modeling is a popular tool to understand a diverse set of complex systems. The output from a computational model depends on a set of parameters that are unknown to the designer, but a modeler can estimate them by collecting physical data. In the described study of the ion channels of ventricular myocytes, the parameter of interest is a function as opposed to a scalar or a set of scalars. This article develops a new modeling strategy to nonparametrically study the functional parameter using Bayesian inference with Gaussian process prior distributions. A new sampling scheme is devised to address this unique problem. Journal: Journal of the American Statistical Association Pages: 500-509 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1119695 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1119695 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:500-509 Template-Type: ReDIF-Article 1.0 Author-Name: Laura Forastiere Author-X-Name-First: Laura Author-X-Name-Last: Forastiere Author-Name: Fabrizia Mealli Author-X-Name-First: Fabrizia Author-X-Name-Last: Mealli Author-Name: Tyler J. VanderWeele Author-X-Name-First: Tyler J. Author-X-Name-Last: VanderWeele Title: Identification and Estimation of Causal Mechanisms in Clustered Encouragement Designs: Disentangling Bed Nets Using Bayesian Principal Stratification Abstract: Exploration of causal mechanisms is often important for researchers and policymakers to understand how an intervention works and how it can be improved. This task can be crucial in clustered encouragement designs (CEDs). Encouragement design studies arise frequently when the treatment cannot be enforced because of ethical or practical constraints and an encouragement intervention (information campaigns, incentives, etc.) is conceived with the purpose of increasing the uptake of the treatment of interest. By design, encouragements always entail the complication of noncompliance. Encouragements can also give rise to a variety of mechanisms, particularly when encouragement is assigned at the cluster level. Social interactions among units within the same cluster can result in spillover effects. Disentangling the effect of encouragement through spillover effects from that through the enhancement of the treatment would give better insight into the intervention and it could be compelling for planning the scaling-up phase of the program. Building on previous works on CEDs and noncompliance, we use the principal stratification framework to define stratum-specific causal effects, that is, effects for specific latent subpopulations, defined by the joint potential compliance statuses under both encouragement conditions. We show how the latter stratum-specific causal effects are related to the decomposition commonly used in the literature and provide flexible homogeneity assumptions under which an extrapolation across principal strata allows one to disentangle the effects. Estimation of causal estimands can be performed with Bayesian inferential methods using hierarchical models to account for clustering. We illustrate the proposed methodology by analyzing a cluster randomized experiment implemented in Zambia and designed to evaluate the impact on malaria prevalence of an agricultural loan program intended to increase the bed net coverage. Farmer households assigned to the program could take advantage of a deferred payment and a discount in the purchase of new bed nets. Our analysis shows a lack of evidence of an effect of the offering of the program to a cluster of households through spillover effects, that is, through a greater bed net coverage in the neighborhood. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 510-525 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1125788 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1125788 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:510-525 Template-Type: ReDIF-Article 1.0 Author-Name: Damião Nóbrega Da Silva Author-X-Name-First: Damião Nóbrega Author-X-Name-Last: Da Silva Author-Name: Chris Skinner Author-X-Name-First: Chris Author-X-Name-Last: Skinner Author-Name: Jae Kwang Kim Author-X-Name-First: Jae Kwang Author-X-Name-Last: Kim Title: Using Binary Paradata to Correct for Measurement Error in Survey Data Analysis Abstract: Paradata refers here to data at unit level on an observed auxiliary variable, not usually of direct scientific interest, which may be informative about the quality of the survey data for the unit. There is increasing interest among survey researchers in how to use such data. Its use to reduce bias from nonresponse has received more attention so far than its use to correct for measurement error. This article considers the latter with a focus on binary paradata indicating the presence of measurement error. A motivating application concerns inference about a regression model, where earnings is a covariate measured with error and whether a respondent refers to pay records is the paradata variable. We specify a parametric model allowing for either normally or t-distributed measurement errors and discuss the assumptions required to identify the regression coefficients. We propose two estimation approaches that take account of complex survey designs: pseudo-maximum likelihood estimation and parametric fractional imputation. These approaches are assessed in a simulation study and are applied to a regression of a measure of deprivation given earnings and other covariates using British Household Panel Survey data. It is found that the proposed approach to correcting for measurement error reduces bias and improves on the precision of a simple approach based on accurate observations. We outline briefly possible extensions to uses of this approach at earlier stages in the survey process. Supplemental materials are available online. Journal: Journal of the American Statistical Association Pages: 526-537 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1130632 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1130632 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:526-537 Template-Type: ReDIF-Article 1.0 Author-Name: Wei Liu Author-X-Name-First: Wei Author-X-Name-Last: Liu Author-Name: Zhiwei Zhang Author-X-Name-First: Zhiwei Author-X-Name-Last: Zhang Author-Name: R. Jason Schroeder Author-X-Name-First: R. Jason Author-X-Name-Last: Schroeder Author-Name: Martin Ho Author-X-Name-First: Martin Author-X-Name-Last: Ho Author-Name: Bo Zhang Author-X-Name-First: Bo Author-X-Name-Last: Zhang Author-Name: Cynthia Long Author-X-Name-First: Cynthia Author-X-Name-Last: Long Author-Name: Hui Zhang Author-X-Name-First: Hui Author-X-Name-Last: Zhang Author-Name: Telba Z. Irony Author-X-Name-First: Telba Z. Author-X-Name-Last: Irony Title: Joint Estimation of Treatment and Placebo Effects in Clinical Trials With Longitudinal Blinding Assessments Abstract: In some therapeutic areas, treatment evaluation is frequently complicated by a possible placebo effect (i.e., the psychobiological effect of a patient's knowledge or belief of being treated). When a substantial placebo effect is likely to exist, it is important to distinguish the treatment and placebo effects in quantifying the clinical benefit of a new treatment. These causal effects can be formally defined in a joint causal model that includes treatment (e.g., new vs. placebo) and treatmentality (i.e., a patient's belief or mentality about which treatment she or he has received) as separate exposures. Information about the treatmentality exposure can be obtained from blinding assessments, which are increasingly common in clinical trials where blinding success is in question. Assuming that treatmentality has a lagged effect and is measured at multiple time points, this article is concerned with joint evaluation of treatment and placebo effects in clinical trials with longitudinal follow-up, possibly with monotone missing data. We describe and discuss several methods adapted from the longitudinal causal inference literature, apply them to a weight loss study, and compare them in simulation experiments that mimic the weight loss study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 538-548 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1130633 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1130633 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:538-548 Template-Type: ReDIF-Article 1.0 Author-Name: Zhe Yu Author-X-Name-First: Zhe Author-X-Name-Last: Yu Author-Name: Raquel Prado Author-X-Name-First: Raquel Author-X-Name-Last: Prado Author-Name: Erin Burke Quinlan Author-X-Name-First: Erin Burke Author-X-Name-Last: Quinlan Author-Name: Steven C. Cramer Author-X-Name-First: Steven C. Author-X-Name-Last: Cramer Author-Name: Hernando Ombao Author-X-Name-First: Hernando Author-X-Name-Last: Ombao Title: Understanding the Impact of Stroke on Brain Motor Function: A Hierarchical Bayesian Approach Abstract: Stroke is a disturbance in blood supply to the brain resulting in the loss of brain functions, particularly motor function. A study was conducted by the UCI Neurorehabilitation Lab to investigate the impact of stroke on motor-related brain regions. Functional MRI (fMRI) data were collected from stroke patients and healthy controls while the subjects performed a simple motor task. In addition to affecting local neuronal activation strength, stroke might also alter communications (i.e., connectivity) between brain regions. We develop a hierarchical Bayesian modeling approach for the analysis of multi-subject fMRI data that allows us to explore brain changes due to stroke. Our approach simultaneously estimates activation and condition-specific connectivity at the group level, and provides estimates for region/subject-specific hemodynamic response functions. Moreover, our model uses spike-and-slab priors to allow for direct posterior inference on the connectivity network. Our results indicate that motor-control regions show greater activation in the unaffected hemisphere and the midline surface in stroke patients than those same regions in healthy controls during the simple motor task. We also note increased connectivity within secondary motor regions in stroke subjects. These findings provide insight into altered neural correlates of movement in subjects who suffered a stroke. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 549-563 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1133425 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1133425 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:549-563 Template-Type: ReDIF-Article 1.0 Author-Name: Eric W. Fox Author-X-Name-First: Eric W. Author-X-Name-Last: Fox Author-Name: Martin B. Short Author-X-Name-First: Martin B. Author-X-Name-Last: Short Author-Name: Frederic P. Schoenberg Author-X-Name-First: Frederic P. Author-X-Name-Last: Schoenberg Author-Name: Kathryn D. Coronges Author-X-Name-First: Kathryn D. Author-X-Name-Last: Coronges Author-Name: Andrea L. Bertozzi Author-X-Name-First: Andrea L. Author-X-Name-Last: Bertozzi Title: Modeling E-mail Networks and Inferring Leadership Using Self-Exciting Point Processes Abstract: We propose various self-exciting point process models for the times when e-mails are sent between individuals in a social network. Using an expectation–maximization (EM)-type approach, we fit these models to an e-mail network dataset from West Point Military Academy and the Enron e-mail dataset. We argue that the self-exciting models adequately capture major temporal clustering features in the data and perform better than traditional stationary Poisson models. We also investigate how accounting for diurnal and weekly trends in e-mail activity improves the overall fit to the observed network data. A motivation and application for fitting these self-exciting models is to use parameter estimates to characterize important e-mail communication behaviors such as the baseline sending rates, average reply rates, and average response times. A primary goal is to use these features, estimated from the self-exciting models, to infer the underlying leadership status of users in the West Point and Enron networks. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 564-584 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1135802 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1135802 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:564-584 Template-Type: ReDIF-Article 1.0 Author-Name: Peter Goos Author-X-Name-First: Peter Author-X-Name-Last: Goos Author-Name: Bradley Jones Author-X-Name-First: Bradley Author-X-Name-Last: Jones Author-Name: Utami Syafitri Author-X-Name-First: Utami Author-X-Name-Last: Syafitri Title: I-Optimal Design of Mixture Experiments Abstract: In mixture experiments, the factors under study are proportions of the ingredients of a mixture. The special nature of the factors necessitates specific types of regression models, and specific types of experimental designs. Although mixture experiments usually are intended to predict the response(s) for all possible formulations of the mixture and to identify optimal proportions for each of the ingredients, little research has been done concerning their I-optimal design. This is surprising given that I-optimal designs minimize the average variance of prediction and, therefore, seem more appropriate for mixture experiments than the commonly used D-optimal designs, which focus on a precise model estimation rather than precise predictions. In this article, we provide the first detailed overview of the literature on the I-optimal design of mixture experiments and identify several contradictions. For the second-order and the special cubic model, we present continuous I-optimal designs and contrast them with the published results. We also study exact I-optimal designs, and compare them in detail to continuous I-optimal designs and to D-optimal designs. One striking result of our work is that the performance of D-optimal designs in terms of the I-optimality criterion very strongly depends on which of the D-optimal designs is considered. Supplemental materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 899-911 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2015.1136632 File-URL: http://hdl.handle.net/10.1080/01621459.2015.1136632 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:899-911 Template-Type: ReDIF-Article 1.0 Author-Name: Daniel Cervone Author-X-Name-First: Daniel Author-X-Name-Last: Cervone Author-Name: Alex D’Amour Author-X-Name-First: Alex Author-X-Name-Last: D’Amour Author-Name: Luke Bornn Author-X-Name-First: Luke Author-X-Name-Last: Bornn Author-Name: Kirk Goldsberry Author-X-Name-First: Kirk Author-X-Name-Last: Goldsberry Title: A Multiresolution Stochastic Process Model for Predicting Basketball Possession Outcomes Abstract: Basketball games evolve continuously in space and time as players constantly interact with their teammates, the opposing team, and the ball. However, current analyses of basketball outcomes rely on discretized summaries of the game that reduce such interactions to tallies of points, assists, and similar events. In this article, we propose a framework for using optical player tracking data to estimate, in real time, the expected number of points obtained by the end of a possession. This quantity, called expected possession value (EPV), derives from a stochastic process model for the evolution of a basketball possession. We model this process at multiple levels of resolution, differentiating between continuous, infinitesimal movements of players, and discrete events such as shot attempts and turnovers. Transition kernels are estimated using hierarchical spatiotemporal models that share information across players while remaining computationally tractable on very large data sets. In addition to estimating EPV, these models reveal novel insights on players’ decision-making tendencies as a function of their spatial strategy. In the supplementary material, we provide a data sample and R code for further exploration of our model and its results. Journal: Journal of the American Statistical Association Pages: 585-599 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2016.1141685 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1141685 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:585-599 Template-Type: ReDIF-Article 1.0 Author-Name: Ryan J. Tibshirani Author-X-Name-First: Ryan J. Author-X-Name-Last: Tibshirani Author-Name: Jonathan Taylor Author-X-Name-First: Jonathan Author-X-Name-Last: Taylor Author-Name: Richard Lockhart Author-X-Name-First: Richard Author-X-Name-Last: Lockhart Author-Name: Robert Tibshirani Author-X-Name-First: Robert Author-X-Name-Last: Tibshirani Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 618-620 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2016.1182787 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1182787 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:618-620 Template-Type: ReDIF-Article 1.0 Author-Name: Lawrence D. Brown Author-X-Name-First: Lawrence D. Author-X-Name-Last: Brown Author-Name: Kory D. Johnson Author-X-Name-First: Kory D. Author-X-Name-Last: Johnson Title: Comment Journal: Journal of the American Statistical Association Pages: 614-617 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2016.1182788 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1182788 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:614-617 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Book Reviews Journal: Journal of the American Statistical Association Pages: 912-919 Issue: 514 Volume: 111 Year: 2016 Month: 4 X-DOI: 10.1080/01621459.2016.1200851 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1200851 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:912-919 Template-Type: ReDIF-Article 1.0 Author-Name: Mauricio Sadinle Author-X-Name-First: Mauricio Author-X-Name-Last: Sadinle Title: Bayesian Estimation of Bipartite Matchings for Record Linkage Abstract: The bipartite record linkage task consists of merging two disparate datafiles containing information on two overlapping sets of entities. This is nontrivial in the absence of unique identifiers and it is important for a wide variety of applications given that it needs to be solved whenever we have to combine information from different sources. Most statistical techniques currently used for record linkage are derived from a seminal article by Fellegi and Sunter in 1969. These techniques usually assume independence in the matching statuses of record pairs to derive estimation procedures and optimal point estimators. We argue that this independence assumption is unreasonable and instead target a bipartite matching between the two datafiles as our parameter of interest. Bayesian implementations allow us to quantify uncertainty on the matching decisions and derive a variety of point estimators using different loss functions. We propose partial Bayes estimates that allow uncertain parts of the bipartite matching to be left unresolved. We evaluate our approach to record linkage using a variety of challenging scenarios and show that it outperforms the traditional methodology. We illustrate the advantages of our methods merging two datafiles on casualties from the civil war of El Salvador. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 600-612 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1148612 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1148612 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:600-612 Template-Type: ReDIF-Article 1.0 Author-Name: Yijian Huang Author-X-Name-First: Yijian Author-X-Name-Last: Huang Title: Restoration of Monotonicity Respecting in Dynamic Regression Abstract: Dynamic regression models, including the quantile regression model and Aalen’s additive hazards model, are widely adopted to investigate evolving covariate effects. Yet lack of monotonicity respecting with standard estimation procedures remains an outstanding issue. Advances have recently been made, but none provides a complete resolution. In this article, we propose a novel adaptive interpolation method to restore monotonicity respecting, by successively identifying and then interpolating nearest monotonicity-respecting points of an original estimator. Under mild regularity conditions, the resulting regression coefficient estimator is shown to be asymptotically equivalent to the original. Our numerical studies have demonstrated that the proposed estimator is much more smooth and may have better finite-sample efficiency than the original as well as, when available as only in special cases, other competing monotonicity-respecting estimators. Illustration with a clinical study is provided. Journal: Journal of the American Statistical Association Pages: 613-622 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1149070 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1149070 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:613-622 Template-Type: ReDIF-Article 1.0 Author-Name: David S. Matteson Author-X-Name-First: David S. Author-X-Name-Last: Matteson Author-Name: Ruey S. Tsay Author-X-Name-First: Ruey S. Author-X-Name-Last: Tsay Title: Independent Component Analysis via Distance Covariance Abstract: This article introduces a novel statistical framework for independent component analysis (ICA) of multivariate data. We propose methodology for estimating mutually independent components, and a versatile resampling-based procedure for inference, including misspecification testing. Independent components are estimated by combining a nonparametric probability integral transformation with a generalized nonparametric whitening method based on distance covariance that simultaneously minimizes all forms of dependence among the components. We prove the consistency of our estimator under minimal regularity conditions and detail conditions for consistency under model misspecification, all while placing assumptions on the observations directly, not on the latent components. U statistics of certain Euclidean distances between sample elements are combined to construct a test statistic for mutually independent components. The proposed measures and tests are based on both necessary and sufficient conditions for mutual independence. We demonstrate the improvements of the proposed method over several competing methods in simulation studies, and we apply the proposed ICA approach to two real examples and contrast it with principal component analysis. Journal: Journal of the American Statistical Association Pages: 623-637 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1150851 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1150851 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:623-637 Template-Type: ReDIF-Article 1.0 Author-Name: Kristin A. Linn Author-X-Name-First: Kristin A. Author-X-Name-Last: Linn Author-Name: Eric B. Laber Author-X-Name-First: Eric B. Author-X-Name-Last: Laber Author-Name: Leonard A. Stefanski Author-X-Name-First: Leonard A. Author-X-Name-Last: Stefanski Title: Interactive -Learning for Quantiles Abstract: A dynamic treatment regime is a sequence of decision rules, each of which recommends treatment based on features of patient medical history such as past treatments and outcomes. Existing methods for estimating optimal dynamic treatment regimes from data optimize the mean of a response variable. However, the mean may not always be the most appropriate summary of performance. We derive estimators of decision rules for optimizing probabilities and quantiles computed with respect to the response distribution for two-stage, binary treatment settings. This enables estimation of dynamic treatment regimes that optimize the cumulative distribution function of the response at a prespecified point or a prespecified quantile of the response distribution such as the median. The proposed methods perform favorably in simulation experiments. We illustrate our approach with data from a sequentially randomized trial where the primary outcome is remission of depression symptoms. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 638-649 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1155993 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1155993 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:638-649 Template-Type: ReDIF-Article 1.0 Author-Name: Shujie Ma Author-X-Name-First: Shujie Author-X-Name-Last: Ma Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Author-Name: Chih-Ling Tsai Author-X-Name-First: Chih-Ling Author-X-Name-Last: Tsai Title: Variable Screening via Quantile Partial Correlation Abstract: In quantile linear regression with ultrahigh-dimensional data, we propose an algorithm for screening all candidate variables and subsequently selecting relevant predictors. Specifically, we first employ quantile partial correlation for screening, and then we apply the extended Bayesian information criterion (EBIC) for best subset selection. Our proposed method can successfully select predictors when the variables are highly correlated, and it can also identify variables that make a contribution to the conditional quantiles but are marginally uncorrelated or weakly correlated with the response. Theoretical results show that the proposed algorithm can yield the sure screening set. By controlling the false selection rate, model selection consistency can be achieved theoretically. In practice, we proposed using EBIC for best subset selection so that the resulting model is screening consistent. Simulation studies demonstrate that the proposed algorithm performs well, and an empirical example is presented. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 650-663 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1156545 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1156545 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:650-663 Template-Type: ReDIF-Article 1.0 Author-Name: Qingning Zhou Author-X-Name-First: Qingning Author-X-Name-Last: Zhou Author-Name: Tao Hu Author-X-Name-First: Tao Author-X-Name-Last: Hu Author-Name: Jianguo Sun Author-X-Name-First: Jianguo Author-X-Name-Last: Sun Title: A Sieve Semiparametric Maximum Likelihood Approach for Regression Analysis of Bivariate Interval-Censored Failure Time Data Abstract: Interval-censored failure time data arise in a number of fields and many authors have discussed various issues related to their analysis. However, most of the existing methods are for univariate data and there exists only limited research on bivariate data, especially on regression analysis of bivariate interval-censored data. We present a class of semiparametric transformation models for the problem and for inference, a sieve maximum likelihood approach is developed. The model provides a great flexibility, in particular including the commonly used proportional hazards model as a special case, and in the approach, Bernstein polynomials are employed. The strong consistency and asymptotic normality of the resulting estimators of regression parameters are established and furthermore, the estimators are shown to be asymptotically efficient. Extensive simulation studies are conducted and indicate that the proposed method works well for practical situations. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 664-672 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1158113 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1158113 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:664-672 Template-Type: ReDIF-Article 1.0 Author-Name: Vikram V. Garg Author-X-Name-First: Vikram V. Author-X-Name-Last: Garg Author-Name: Roy H. Stogner Author-X-Name-First: Roy H. Author-X-Name-Last: Stogner Title: Hierarchical Latin Hypercube Sampling Abstract: Latin hypercube sampling (LHS) is a robust, scalable Monte Carlo method that is used in many areas of science and engineering. We present a new algorithm for generating hierarchic Latin hypercube sets (HLHS) that are recursively divisible into LHS subsets. Based on this new construction, we introduce a hierarchical incremental LHS (HILHS) method that allows the user to employ LHS in a flexibly incremental setting. This overcomes a drawback of many LHS schemes that require the entire sample set to be selected a priori, or only allow very large increments. We derive the sampling properties for HLHS designs and HILHS estimators. We also present numerical studies that showcase the flexible incrementation offered by HILHS. Journal: Journal of the American Statistical Association Pages: 673-682 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1158717 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1158717 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:673-682 Template-Type: ReDIF-Article 1.0 Author-Name: Fasheng Sun Author-X-Name-First: Fasheng Author-X-Name-Last: Sun Author-Name: Boxin Tang Author-X-Name-First: Boxin Author-X-Name-Last: Tang Title: A Method of Constructing Space-Filling Orthogonal Designs Abstract: This article presents a method of constructing a rich class of orthogonal designs that include orthogonal Latin hypercubes as special cases. Two prominent features of the method are its simplicity and generality. In addition to orthogonality, the resulting designs enjoy some attractive space-filling properties, making them very suitable for computer experiments. Journal: Journal of the American Statistical Association Pages: 683-689 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1159211 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1159211 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:683-689 Template-Type: ReDIF-Article 1.0 Author-Name: Ruiyan Luo Author-X-Name-First: Ruiyan Author-X-Name-Last: Luo Author-Name: Xin Qi Author-X-Name-First: Xin Author-X-Name-Last: Qi Title: Function-on-Function Linear Regression by Signal Compression Abstract: We consider functional linear regression models with a functional response and multiple functional predictors, with the goal of finding the best finite-dimensional approximation to the signal part of the response function. Defining the integrated squared correlation coefficient between a random variable and a random function, we propose to solve a penalized generalized functional eigenvalue problem, whose solutions satisfy that projections on the original predictors generate new scalar uncorrelated variables and these variables have the largest integrated squared correlation coefficient with the signal function. With these new variables, we transform the original function-on-function regression model to a function-on-scalar regression model whose predictors are uncorrelated, and estimate the model by penalized least-square method. This method is also extended to models with both multiple functional and scalar predictors. We provide the asymptotic consistency and the corresponding convergence rates for our estimates. Simulation studies in various settings and for both one and multiple functional predictors demonstrate that our approach has good predictive performance and is very computational efficient. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 690-705 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1164053 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1164053 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:690-705 Template-Type: ReDIF-Article 1.0 Author-Name: Alexander Schnurr Author-X-Name-First: Alexander Author-X-Name-Last: Schnurr Author-Name: Herold Dehling Author-X-Name-First: Herold Author-X-Name-Last: Dehling Title: Testing for Structural Breaks via Ordinal Pattern Dependence Abstract: We propose new concepts to analyze and model the dependence structure between two time series. Our methods rely exclusively on the order structure of the data points. Hence, the methods are stable under monotone transformations of the time series and robust against small perturbations or measurement errors. Ordinal pattern dependence can be characterized by four parameters. We propose estimators for these parameters, and we calculate their asymptotic distributions. Furthermore, we derive a test for structural breaks within the dependence structure. All results are supplemented by simulation studies and empirical examples. For three consecutive data points attaining different values, there are six possibilities how their values can be ordered. These possibilities are called ordinal patterns. Our first idea is simply to count the number of coincidences of patterns in both time series and to compare this with the expected number in the case of independence. If we detect a lot of coincident patterns, it would indicate that the up-and-down behavior is similar. Hence, our concept can be seen as a way to measure nonlinear “correlation.” We show in the last section how to generalize the concept to capture various other kinds of dependence. Journal: Journal of the American Statistical Association Pages: 706-720 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1164706 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1164706 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:706-720 Template-Type: ReDIF-Article 1.0 Author-Name: David B. Dahl Author-X-Name-First: David B. Author-X-Name-Last: Dahl Author-Name: Ryan Day Author-X-Name-First: Ryan Author-X-Name-Last: Day Author-Name: Jerry W. Tsai Author-X-Name-First: Jerry W. Author-X-Name-Last: Tsai Title: Random Partition Distribution Indexed by Pairwise Information Abstract: We propose a random partition distribution indexed by pairwise similarity information such that partitions compatible with the similarities are given more probability. The use of pairwise similarities, in the form of distances, is common in some clustering algorithms (e.g., hierarchical clustering), but we show how to use this type of information to define a prior partition distribution for flexible Bayesian modeling. A defining feature of the distribution is that it allocates probability among partitions within a given number of subsets, but it does not shift probability among sets of partitions with different numbers of subsets. Our distribution places more probability on partitions that group similar items yet keeps the total probability of partitions with a given number of subsets constant. The distribution of the number of subsets (and its moments) is available in closed-form and is not a function of the similarities. Our formulation has an explicit probability mass function (with a tractable normalizing constant) so the full suite of MCMC methods may be used for posterior inference. We compare our distribution with several existing partition distributions, showing that our formulation has attractive properties. We provide three demonstrations to highlight the features and relative performance of our distribution. Journal: Journal of the American Statistical Association Pages: 721-732 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1165103 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1165103 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:721-732 Template-Type: ReDIF-Article 1.0 Author-Name: Daniel R. Kowal Author-X-Name-First: Daniel R. Author-X-Name-Last: Kowal Author-Name: David S. Matteson Author-X-Name-First: David S. Author-X-Name-Last: Matteson Author-Name: David Ruppert Author-X-Name-First: David Author-X-Name-Last: Ruppert Title: A Bayesian Multivariate Functional Dynamic Linear Model Abstract: We present a Bayesian approach for modeling multivariate, dependent functional data. To account for the three dominant structural features in the data—functional, time dependent, and multivariate components—we extend hierarchical dynamic linear models for multivariate time series to the functional data setting. We also develop Bayesian spline theory in a more general constrained optimization framework. The proposed methods identify a time-invariant functional basis for the functional observations, which is smooth and interpretable, and can be made common across multivariate observations for additional information sharing. The Bayesian framework permits joint estimation of the model parameters, provides exact inference (up to MCMC error) on specific parameters, and allows generalized dependence structures. Sampling from the posterior distribution is accomplished with an efficient Gibbs sampling algorithm. We illustrate the proposed framework with two applications: (1) multi-economy yield curve data from the recent global recession, and (2) local field potential brain signals in rats, for which we develop a multivariate functional time series approach for multivariate time–frequency analysis. Supplementary materials, including R code and the multi-economy yield curve data, are available online. Journal: Journal of the American Statistical Association Pages: 733-744 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1165104 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1165104 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:733-744 Template-Type: ReDIF-Article 1.0 Author-Name: Yifei Sun Author-X-Name-First: Yifei Author-X-Name-Last: Sun Author-Name: Mei-Cheng Wang Author-X-Name-First: Mei-Cheng Author-X-Name-Last: Wang Title: Evaluating Utility Measurement From Recurrent Marker Processes in the Presence of Competing Terminal Events Abstract: In follow-up studies, utility marker measurements are usually collected upon the occurrence of recurrent events until a terminal event such as death takes place. In this article, we define the recurrent marker process to characterize utility accumulation over time. For example, with medical cost and repeated hospitalizations being treated as marker and recurrent events, respectively, the recurrent marker process is the trajectory of cumulative cost, which stops to increase after death. In many applications, competing risks arise as subjects are at risk of more than one mutually exclusive terminal event, such as death from different causes, and modeling the recurrent marker process for each failure type is often of interest. However, censoring creates challenges in the methodological development, because for censored subjects, both failure type and recurrent marker process after censoring are unobserved. To circumvent this problem, we propose a nonparametric framework for the recurrent marker process with competing terminal events. In the presence of competing risks, we start with an estimator by using marker information from uncensored subjects. As a result, the estimator can be inefficient under heavy censoring. To improve efficiency, we propose a second estimator by combining the first estimator with auxiliary information from the estimate under noncompeting risks model. The large sample properties and optimality of the second estimator are established. Simulation studies and an application to the SEER-Medicare linked data are presented to illustrate the proposed methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 745-756 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1166113 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1166113 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:745-756 Template-Type: ReDIF-Article 1.0 Author-Name: Xianyang Zhang Author-X-Name-First: Xianyang Author-X-Name-Last: Zhang Author-Name: Guang Cheng Author-X-Name-First: Guang Author-X-Name-Last: Cheng Title: Simultaneous Inference for High-Dimensional Linear Models Abstract: This article proposes a bootstrap-assisted procedure to conduct simultaneous inference for high-dimensional sparse linear models based on the recent desparsifying Lasso estimator. Our procedure allows the dimension of the parameter vector of interest to be exponentially larger than sample size, and it automatically accounts for the dependence within the desparsifying Lasso estimator. Moreover, our simultaneous testing method can be naturally coupled with the margin screening to enhance its power in sparse testing with a reduced computational cost, or with the step-down method to provide a strong control for the family-wise error rate. In theory, we prove that our simultaneous testing procedure asymptotically achieves the prespecified significance level, and enjoys certain optimality in terms of its power even when the model errors are non-Gaussian. Our general theory is also useful in studying the support recovery problem. To broaden the applicability, we further extend our main results to generalized linear models with convex loss functions. The effectiveness of our methods is demonstrated via simulation studies. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 757-768 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1166114 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1166114 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:757-768 Template-Type: ReDIF-Article 1.0 Author-Name: Ailin Fan Author-X-Name-First: Ailin Author-X-Name-Last: Fan Author-Name: Rui Song Author-X-Name-First: Rui Author-X-Name-Last: Song Author-Name: Wenbin Lu Author-X-Name-First: Wenbin Author-X-Name-Last: Lu Title: Change-Plane Analysis for Subgroup Detection and Sample Size Calculation Abstract: We propose a systematic method for testing and identifying a subgroup with an enhanced treatment effect. We adopts a change-plane technique to first test the existence of a subgroup, and then identify the subgroup if the null hypothesis on nonexistence of such a subgroup is rejected. A semiparametric model is considered for the response with an unspecified baseline function and an interaction between a subgroup indicator and treatment. A doubly robust test statistic is constructed based on this model, and asymptotic distributions of the test statistic under both null and local alternative hypotheses are derived. Moreover, a sample size calculation method for subgroup detection is developed based on the proposed statistic. The finite sample performance of the proposed test is evaluated via simulations. Finally, the proposed methods for subgroup identification and sample size calculation are applied to a data from an AIDS study. Journal: Journal of the American Statistical Association Pages: 769-778 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1166115 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1166115 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:769-778 Template-Type: ReDIF-Article 1.0 Author-Name: Yang Ni Author-X-Name-First: Yang Author-X-Name-Last: Ni Author-Name: Francesco C. Stingo Author-X-Name-First: Francesco C. Author-X-Name-Last: Stingo Author-Name: Veerabhadran Baladandayuthapani Author-X-Name-First: Veerabhadran Author-X-Name-Last: Baladandayuthapani Title: Sparse Multi-Dimensional Graphical Models: A Unified Bayesian Framework Abstract: Multi-dimensional data constituted by measurements along multiple axes have emerged across many scientific areas such as genomics and cancer surveillance. A common objective is to investigate the conditional dependencies among the variables along each axes taking into account multi-dimensional structure of the data. Traditional multivariate approaches are unsuitable for such highly structured data due to inefficiency, loss of power, and lack of interpretability. In this article, we propose a novel class of multi-dimensional graphical models based on matrix decompositions of the precision matrices along each dimension. Our approach is a unified framework applicable to both directed and undirected decomposable graphs as well as arbitrary combinations of these. Exploiting the marginalization of the likelihood, we develop efficient posterior sampling schemes based on partially collapsed Gibbs samplers. Empirically, through simulation studies, we show the superior performance of our approach in comparison with those of benchmark and state-of-the-art methods. We illustrate our approaches using two datasets: ovarian cancer proteomics and U.S. cancer mortality. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 779-793 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1167694 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1167694 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:779-793 Template-Type: ReDIF-Article 1.0 Author-Name: Gongjun Xu Author-X-Name-First: Gongjun Author-X-Name-Last: Xu Author-Name: Sy Han Chiou Author-X-Name-First: Sy Han Author-X-Name-Last: Chiou Author-Name: Chiung-Yu Huang Author-X-Name-First: Chiung-Yu Author-X-Name-Last: Huang Author-Name: Mei-Cheng Wang Author-X-Name-First: Mei-Cheng Author-X-Name-Last: Wang Author-Name: Jun Yan Author-X-Name-First: Jun Author-X-Name-Last: Yan Title: Joint Scale-Change Models for Recurrent Events and Failure Time Abstract: Recurrent event data arise frequently in various fields such as biomedical sciences, public health, engineering, and social sciences. In many instances, the observation of the recurrent event process can be stopped by the occurrence of a correlated failure event, such as treatment failure and death. In this article, we propose a joint scale-change model for the recurrent event process and the failure time, where a shared frailty variable is used to model the association between the two types of outcomes. In contrast to the popular Cox-type joint modeling approaches, the regression parameters in the proposed joint scale-change model have marginal interpretations. The proposed approach is robust in the sense that no parametric assumption is imposed on the distribution of the unobserved frailty and that we do not need the strong Poisson-type assumption for the recurrent event process. We establish consistency and asymptotic normality of the proposed semiparametric estimators under suitable regularity conditions. To estimate the corresponding variances of the estimators, we develop a computationally efficient resampling-based procedure. Simulation studies and an analysis of hospitalization data from the Danish Psychiatric Central Register illustrate the performance of the proposed method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 794-805 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1173557 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1173557 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:794-805 Template-Type: ReDIF-Article 1.0 Author-Name: Andrés F. Barrientos Author-X-Name-First: Andrés F. Author-X-Name-Last: Barrientos Author-Name: Alejandro Jara Author-X-Name-First: Alejandro Author-X-Name-Last: Jara Author-Name: Fernando A. Quintana Author-X-Name-First: Fernando A. Author-X-Name-Last: Quintana Title: Fully Nonparametric Regression for Bounded Data Using Dependent Bernstein Polynomials Abstract: We propose a novel class of probability models for sets of predictor-dependent probability distributions with bounded domain. The proposal extends the Dirichlet–Bernstein prior for single density estimation, by using dependent stick-breaking processes. A general model class and two simplified versions are discussed in detail. Appealing theoretical properties such as continuity, association structure, marginal distribution, large support, and consistency of the posterior distribution are established for all models. The behavior of the models is illustrated using simulated and real-life data. The simulated data are also used to compare the proposed methodology to existing methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 806-825 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1180987 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1180987 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:806-825 Template-Type: ReDIF-Article 1.0 Author-Name: Yifei Sun Author-X-Name-First: Yifei Author-X-Name-Last: Sun Author-Name: Chiung-Yu Huang Author-X-Name-First: Chiung-Yu Author-X-Name-Last: Huang Author-Name: Mei-Cheng Wang Author-X-Name-First: Mei-Cheng Author-X-Name-Last: Wang Title: Nonparametric Benefit–Risk Assessment Using Marker Process in the Presence of a Terminal Event Abstract: Benefit–risk assessment is a crucial step in medical decision process. In many biomedical studies, both longitudinal marker measurements and time to a terminal event serve as important endpoints for benefit–risk assessment. The effect of an intervention or a treatment on the longitudinal marker process, however, can be in conflict with its effect on the time to the terminal event. Thus, questions arise on how to evaluate treatment effects based on the two endpoints, for the purpose of deciding on which treatment is most likely to benefit the patients. In this article, we present a unified framework for benefit–risk assessment using the observed longitudinal markers and time to event data. We propose a cumulative weighted marker process to synthesize information from the two endpoints, and use its mean function at a prespecified time point as a benefit–risk summary measure. We consider nonparametric estimation of the summary measure under two scenarios: (i) the longitudinal marker is measured intermittently during the study period, and (ii) the value of the longitudinal marker is observed throughout the entire follow-up period. The large-sample properties of the estimators are derived and compared. Simulation studies and data examples exhibit that the proposed methods are easy to implement and reliable for practical use. Supplemental materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 826-836 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1180988 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1180988 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:826-836 Template-Type: ReDIF-Article 1.0 Author-Name: Ang Li Author-X-Name-First: Ang Author-X-Name-Last: Li Author-Name: Rina Foygel Barber Author-X-Name-First: Rina Foygel Author-X-Name-Last: Barber Title: Accumulation Tests for FDR Control in Ordered Hypothesis Testing Abstract: Multiple testing problems arising in modern scientific applications can involve simultaneously testing thousands or even millions of hypotheses, with relatively few true signals. In this article, we consider the multiple testing problem where prior information is available (for instance, from an earlier study under different experimental conditions), that can allow us to test the hypotheses as a ranked list to increase the number of discoveries. Given an ordered list of n hypotheses, the aim is to select a data-dependent cutoff k and declare the first k hypotheses to be statistically significant while bounding the false discovery rate (FDR). Generalizing several existing methods, we develop a family of “accumulation tests” to choose a cutoff k that adapts to the amount of signal at the top of the ranked list. We introduce a new method in this family, the HingeExp method, which offers higher power to detect true signals compared to existing techniques. Our theoretical results prove that these methods control a modified FDR on finite samples, and characterize the power of the methods in the family. We apply the tests to simulated data, including a high-dimensional model selection problem for linear regression. We also compare accumulation tests to existing methods for multiple testing on a real data problem of identifying differential gene expression over a dosage gradient. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 837-849 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1180989 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1180989 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:837-849 Template-Type: ReDIF-Article 1.0 Author-Name: Hélène Juillard Author-X-Name-First: Hélène Author-X-Name-Last: Juillard Author-Name: Guillaume Chauvet Author-X-Name-First: Guillaume Author-X-Name-Last: Chauvet Author-Name: Anne Ruiz-Gazen Author-X-Name-First: Anne Author-X-Name-Last: Ruiz-Gazen Title: Estimation Under Cross-Classified Sampling With Application to a Childhood Survey Abstract: The cross-classified sampling design consists in drawing samples from a two-dimensional population, independently in each dimension. Such design is commonly used in consumer price index surveys and has been recently applied to draw a sample of babies in the French Longitudinal Survey on Childhood, by crossing a sample of maternity units and a sample of days. We propose to derive a general theory of estimation for this sampling design. We consider the Horvitz–Thompson estimator for a total, and show that the cross-classified design will usually result in a loss of efficiency as compared to the widespread two-stage design. We obtain the asymptotic distribution of the Horvitz–Thompson estimator and several unbiased variance estimators. Facing the problem of possibly negative values, we propose simplified nonnegative variance estimators and study their bias under a super-population model. The proposed estimators are compared for totals and ratios on simulated data. An application on real data from the French Longitudinal Survey on Childhood is also presented, and we make some recommendations. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 850-858 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1186028 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1186028 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:850-858 Template-Type: ReDIF-Article 1.0 Author-Name: Guohui Wu Author-X-Name-First: Guohui Author-X-Name-Last: Wu Author-Name: Scott H. Holan Author-X-Name-First: Scott H. Author-X-Name-Last: Holan Title: Bayesian Hierarchical Multi-Population Multistate Jolly–Seber Models With Covariates: Application to the Pallid Sturgeon Population Assessment Program Abstract: Estimating abundance for multiple populations is of fundamental importance to many ecological monitoring programs. Equally important is quantifying the spatial distribution and characterizing the migratory behavior of target populations within the study domain. To achieve these goals, we propose a Bayesian hierarchical multi-population multistate Jolly–Seber model that incorporates covariates. The model is proposed using a state-space framework and has several distinct advantages. First, multiple populations within the same study area can be modeled simultaneously. As a consequence, it is possible to achieve improved parameter estimation by “borrowing strength” across different populations. In many cases, such as our motivating example involving endangered species, this borrowing of strength is crucial, as there is relatively less information for one of the populations under consideration. Second, in addition to accommodating covariate information, we develop a computationally efficient Markov chain Monte Carlo algorithm that requires no tuning. Importantly, the model we propose allows us to draw inference on each population as well as on multiple populations simultaneously. Finally, we demonstrate the effectiveness of our method through a motivating example of estimating the spatial distribution and migration of hatchery and wild populations of the endangered pallid sturgeon (Scaphirhynchus albus), using data from the Pallid Sturgeon Population Assessment Program on the Lower Missouri River. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 471-483 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1211531 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1211531 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:471-483 Template-Type: ReDIF-Article 1.0 Author-Name: Giampiero Marra Author-X-Name-First: Giampiero Author-X-Name-Last: Marra Author-Name: Rosalba Radice Author-X-Name-First: Rosalba Author-X-Name-Last: Radice Author-Name: Till Bärnighausen Author-X-Name-First: Till Author-X-Name-Last: Bärnighausen Author-Name: Simon N. Wood Author-X-Name-First: Simon N. Author-X-Name-Last: Wood Author-Name: Mark E. McGovern Author-X-Name-First: Mark E. Author-X-Name-Last: McGovern Title: A Simultaneous Equation Approach to Estimating HIV Prevalence With Nonignorable Missing Responses Abstract: Estimates of HIV prevalence are important for policy to establish the health status of a country’s population and to evaluate the effectiveness of population-based interventions and campaigns. However, participation rates in testing for surveillance conducted as part of household surveys, on which many of these estimates are based, can be low. HIV positive individuals may be less likely to participate because they fear disclosure, in which case estimates obtained using conventional approaches to deal with missing data, such as imputation-based methods, will be biased. We develop a Heckman-type simultaneous equation approach that accounts for nonignorable selection, but unlike previous implementations, allows for spatial dependence and does not impose a homogenous selection process on all respondents. In addition, our framework addresses the issue of separation, where for instance some factors are severely unbalanced and highly predictive of the response, which would ordinarily prevent model convergence. Estimation is carried out within a penalized likelihood framework where smoothing is achieved using a parameterization of the smoothing criterion, which makes estimation more stable and efficient. We provide the software for straightforward implementation of the proposed approach, and apply our methodology to estimating national and sub-national HIV prevalence in Swaziland, Zimbabwe, and Zambia. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 484-496 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1224713 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1224713 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:484-496 Template-Type: ReDIF-Article 1.0 Author-Name: Ephraim M. Hanks Author-X-Name-First: Ephraim M. Author-X-Name-Last: Hanks Title: Modeling Spatial Covariance Using the Limiting Distribution of Spatio-Temporal Random Walks Abstract: We present an approach for modeling areal spatial covariance in observed genetic allele data by considering the stationary (limiting) distribution of a spatio-temporal Markov random walk model for gene flow. This stationary distribution corresponds to an intrinsic simultaneous autoregressive (SAR) model for spatial correlation, and provides a principled approach to specifying areal spatial models when a spatio-temporal generating process can be assumed. We apply the approach to a study of spatial genetic variation of trout in a stream network in Connecticut, USA. Journal: Journal of the American Statistical Association Pages: 497-507 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1224714 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1224714 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:497-507 Template-Type: ReDIF-Article 1.0 Author-Name: Beibei Guo Author-X-Name-First: Beibei Author-X-Name-Last: Guo Author-Name: Ying Yuan Author-X-Name-First: Ying Author-X-Name-Last: Yuan Title: Bayesian Phase I/II Biomarker-Based Dose Finding for Precision Medicine With Molecularly Targeted Agents Abstract: The optimal dose for treating patients with a molecularly targeted agent may differ according to the patient's individual characteristics, such as biomarker status. In this article, we propose a Bayesian phase I/II dose-finding design to find the optimal dose that is personalized for each patient according to his/her biomarker status. To overcome the curse of dimensionality caused by the relatively large number of biomarkers and their interactions with the dose, we employ canonical partial least squares (CPLS) to extract a small number of components from the covariate matrix containing the dose, biomarkers, and dose-by-biomarker interactions. Using these components as the covariates, we model the ordinal toxicity and efficacy using the latent-variable approach. Our model accounts for important features of molecularly targeted agents. We quantify the desirability of the dose using a utility function and propose a two-stage dose-finding algorithm to find the personalized optimal dose according to each patient's individual biomarker profile. Simulation studies show that our proposed design has good operating characteristics, with a high probability of identifying the personalized optimal dose. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 508-520 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1228534 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1228534 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:508-520 Template-Type: ReDIF-Article 1.0 Author-Name: Justin Strait Author-X-Name-First: Justin Author-X-Name-Last: Strait Author-Name: Sebastian Kurtek Author-X-Name-First: Sebastian Author-X-Name-Last: Kurtek Author-Name: Emily Bartha Author-X-Name-First: Emily Author-X-Name-Last: Bartha Author-Name: Steven N. MacEachern Author-X-Name-First: Steven N. Author-X-Name-Last: MacEachern Title: Landmark-Constrained Elastic Shape Analysis of Planar Curves Abstract: Various approaches to statistical shape analysis exist in current literature. They mainly differ in the representations, metrics, and/or methods for alignment of shapes. One such approach is based on landmarks, that is, mathematically or structurally meaningful points, which ignores the remaining outline information. Elastic shape analysis, a more recent approach, attempts to fix this by using a special functional representation of the parametrically defined outline to perform shape registration, and subsequent statistical analyses. However, the lack of landmark identification can lead to unnatural alignment, particularly in biological and medical applications, where certain features are crucial to shape structure, comparison, and modeling. The main contribution of this work is the definition of a joint landmark-constrained elastic statistical shape analysis framework. We treat landmark points as constraints in the full shape analysis process. Thus, we inherit benefits of both methods: the landmarks help disambiguate shape alignment when the fully automatic elastic shape analysis framework produces unsatisfactory solutions. We provide standard statistical tools on the landmark-constrained shape space including mean and covariance calculation, classification, clustering, and tangent principal component analysis (PCA). We demonstrate the benefits of the proposed framework on complex shapes from the MPEG-7 dataset and two real data examples: mice T2 vertebrae and Hawaiian Drosophila fly wings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 521-533 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1236726 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1236726 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:521-533 Template-Type: ReDIF-Article 1.0 Author-Name: Brian L. Egleston Author-X-Name-First: Brian L. Author-X-Name-Last: Egleston Author-Name: Robert G. Uzzo Author-X-Name-First: Robert G. Author-X-Name-Last: Uzzo Author-Name: Yu-Ning Wong Author-X-Name-First: Yu-Ning Author-X-Name-Last: Wong Title: Latent Class Survival Models Linked by Principal Stratification to Investigate Heterogenous Survival Subgroups Among Individuals With Early-Stage Kidney Cancer Abstract: Rates of kidney cancer have been increasing, with small incidental tumors experiencing the fastest growth rates. Much of the increase could be due to increased use of CT scans, MRIs, and ultrasounds for unrelated conditions. Many tumors might never have been detected or become symptomatic in the past. This suggests that many patients might benefit from less aggressive therapy, such as active surveillance by which tumors are surgically removed only if they become sufficiently large. However, it has been difficult for clinicians to identify subgroups of patients for whom treatment might be especially beneficial or harmful. In this work, we use a principal stratification framework to estimate the proportion and characteristics of individuals who have large or small hazard rates of death in two treatment arms. This allows us to assess who might be helped or harmed by aggressive treatment. We also use Weibull mixture models. This work differs from much previous work in that the survival classes upon which principal stratification is based are latent variables. That is, survival class is not an observed variable. We apply this work using Surveillance Epidemiology and End Results-Medicare claims data. Clinicians can use our methods for investigating treatments with heterogenous effects. Journal: Journal of the American Statistical Association Pages: 534-546 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1240078 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240078 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:534-546 Template-Type: ReDIF-Article 1.0 Author-Name: José R. Zubizarreta Author-X-Name-First: José R. Author-X-Name-Last: Zubizarreta Author-Name: Luke Keele Author-X-Name-First: Luke Author-X-Name-Last: Keele Title: Optimal Multilevel Matching in Clustered Observational Studies: A Case Study of the Effectiveness of Private Schools Under a Large-Scale Voucher System Abstract: A distinctive feature of a clustered observational study is its multilevel or nested data structure arising from the assignment of treatment, in a nonrandom manner, to groups or clusters of units or individuals. Examples are ubiquitous in the health and social sciences including patients in hospitals, employees in firms, and students in schools. What is the optimal matching strategy in a clustered observational study? At first thought, one might start by matching clusters of individuals and then, within matched clusters, continue by matching individuals. But as we discuss in this article, the optimal strategy is the opposite: in typical applications, where the intracluster correlation is not one, it is best to first match individuals and, once all possible combinations of matched individuals are known, then match clusters. In this article, we use dynamic and integer programming to implement this strategy and extend optimal matching methods to hierarchical and multilevel settings. Among other matched designs, our strategy can approximate a paired clustered randomized study by finding the largest sample of matched pairs of treated and control individuals within matched pairs of treated and control clusters that is balanced according to specifications given by the investigator. This strategy directly balances covariates both at the cluster and individual levels and does not require estimating the propensity score, although the propensity score can be balanced as an additional covariate. We illustrate our results with a case study of the comparative effectiveness of public versus private voucher schools in Chile, a question of intense policy debate in the country at the present. Journal: Journal of the American Statistical Association Pages: 547-560 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1240683 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240683 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:547-560 Template-Type: ReDIF-Article 1.0 Author-Name: Dalia Chakrabarty Author-X-Name-First: Dalia Author-X-Name-Last: Chakrabarty Title: A New Bayesian Test to Test for the Intractability-Countering Hypothesis Abstract: We present a new test of hypothesis in which we seek the probability of the null conditioned on the data, where the null is a simplification undertaken to counter the intractability of the more complex model that the simpler null model is nested within. With the more complex model rendered intractable, the null model uses a simplifying assumption that capacitates the learning of an unknown parameter vector given the data. Bayes factors are shown to be known only up to a ratio of unknown data-dependent constants—a problem that cannot be cured using prescriptions similar to those suggested to solve the problem caused to Bayes factor computation, by noninformative priors. Thus, a new test is needed in which we can circumvent Bayes factor computation. In this test, we undertake generation of data from the model in which the null hypothesis is true and can achieve support in the measured data for the null by comparing the marginalized posterior of the model parameter given the measured data, to that given such generated data. However, such a ratio of marginalized posteriors can confound interpretation of comparison of support in one measured data for a null, with that in another dataset for a different null. Given an application in which such comparison is undertaken, we alternatively define support in a measured dataset for a null by identifying the model parameters that are less consistent with the measured data than is minimally possible given the generated data, and realizing that the higher the number of such parameter values, less is the support in the measured data for the null. Then, the probability of the null conditional on the data is given within a Markov chain Monte Carlo (MCMC)-based scheme, by marginalizing the posterior given the measured data, over parameter values that are as, or more consistent with the measured data, than with the generated data. In the aforementioned application, we test the hypothesis that a galactic state-space bears an isotropic geometry, where the (missing) data comprising measurements of some components of the state-space vector of a sample of observed galactic particles are implemented to Bayesianly learn the gravitational mass density of all matter in the galaxy. In lieu of an assumption about the state-space being isotropic, the likelihood of the sought gravitational mass density given the data is intractable. For a real example galaxy, we find unequal values of the probability of the null—that the host state-space is isotropic—given two different datasets, implying that in this galaxy, the system state-space constitutes at least two disjoint sub-volumes that the two datasets, respectively, live in. Implementation on simulated galactic data is also undertaken, as is an empirical illustration on the well-known O-ring data, to test for the form of the thermal variation of the failure probability of the O-rings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 561-577 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1240684 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240684 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:561-577 Template-Type: ReDIF-Article 1.0 Author-Name: Mevin B. Hooten Author-X-Name-First: Mevin B. Author-X-Name-Last: Hooten Author-Name: Devin S. Johnson Author-X-Name-First: Devin S. Author-X-Name-Last: Johnson Title: Basis Function Models for Animal Movement Abstract: Advances in satellite-based data collection techniques have served as a catalyst for new statistical methodology to analyze these data. In wildlife ecological studies, satellite-based data and methodology have provided a wealth of information about animal space use and the investigation of individual-based animal–environment relationships. With the technology for data collection improving dramatically over time, we are left with massive archives of historical animal telemetry data of varying quality. While many contemporary statistical approaches for inferring movement behavior are specified in discrete time, we develop a flexible continuous-time stochastic integral equation framework that is amenable to reduced-rank second-order covariance parameterizations. We demonstrate how the associated first-order basis functions can be constructed to mimic behavioral characteristics in realistic trajectory processes using telemetry data from mule deer and mountain lion individuals in western North America. Our approach is parallelizable and provides inference for heterogenous trajectories using nonstationary spatial modeling techniques that are feasible for large telemetry datasets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 578-589 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1246250 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1246250 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:578-589 Template-Type: ReDIF-Article 1.0 Author-Name: Matthew Blackwell Author-X-Name-First: Matthew Author-X-Name-Last: Blackwell Title: Instrumental Variable Methods for Conditional Effects and Causal Interaction in Voter Mobilization Experiments Abstract: In democratic countries, voting is one of the most important ways for citizens to influence policy and hold their representatives accountable. And yet, in the United States and many other countries, rates of voter turnout are alarmingly low. Every election cycle, mobilization efforts encourage citizens to vote and ensure that elections reflect the true will of the people. To establish the most effective way of encouraging voter turnout, this article seeks to differentiate between (1) the synergy hypothesis that multiple instances of voter contact increase the effectiveness of a single form of contact, and (2) the diminishing returns hypothesis that multiple instances of contact are less effective or even counterproductive. Remarkably, previous studies have been unable to compare these hypotheses because extant approaches to analyzing experiments with noncompliance cannot speak to questions of causal interaction. I resolve this impasse by extending the traditional instrumental variables framework to accommodate multiple treatment–instrument pairs, which allows for the estimation of conditional and interaction effects to adjudicate between synergy and diminishing returns. The analysis of two voter mobilization field experiments provides the first evidence of diminishing returns to follow-up contact and a cautionary tale about experimental design for these quantities. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 590-599 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1246363 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1246363 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:590-599 Template-Type: ReDIF-Article 1.0 Author-Name: Sokbae Lee Author-X-Name-First: Sokbae Author-X-Name-Last: Lee Author-Name: Myung Hwan Seo Author-X-Name-First: Myung Hwan Author-X-Name-Last: Seo Author-Name: Youngki Shin Author-X-Name-First: Youngki Author-X-Name-Last: Shin Title: Correction Abstract: This note provides correction to Lee, S., Seo, M. H., and Shin, Y. (2011), “Testing for Threshold Effects in Regression Models,” Journal of the American Statistical Association, 106, 220–231.1 Journal: Journal of the American Statistical Association Pages: 883-883 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2016.1273114 File-URL: http://hdl.handle.net/10.1080/01621459.2016.1273114 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:883-883 Template-Type: ReDIF-Article 1.0 Author-Name: David M. Blei Author-X-Name-First: David M. Author-X-Name-Last: Blei Author-Name: Alp Kucukelbir Author-X-Name-First: Alp Author-X-Name-Last: Kucukelbir Author-Name: Jon D. McAuliffe Author-X-Name-First: Jon D. Author-X-Name-Last: McAuliffe Title: Variational Inference: A Review for Statisticians Abstract: One of the core problems of modern statistics is to approximate difficult-to-compute probability densities. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation involving the posterior density. In this article, we review variational inference (VI), a method from machine learning that approximates probability densities through optimization. VI has been used in many applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of densities and then to find a member of that family which is close to the target density. Closeness is measured by Kullback–Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data. We discuss modern research in VI and highlight important open problems. VI is powerful, but it is not yet well understood. Our hope in writing this article is to catalyze statistical research on this class of algorithms. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 859-877 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2017.1285773 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1285773 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:859-877 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Book Reviews Journal: Journal of the American Statistical Association Pages: 878-882 Issue: 518 Volume: 112 Year: 2017 Month: 4 X-DOI: 10.1080/01621459.2017.1325629 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1325629 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:878-882 Template-Type: ReDIF-Article 1.0 Author-Name: Stéphane Guerrier Author-X-Name-First: Stéphane Author-X-Name-Last: Guerrier Author-Name: Elise Dupuis-Lozeron Author-X-Name-First: Elise Author-X-Name-Last: Dupuis-Lozeron Author-Name: Yanyuan Ma Author-X-Name-First: Yanyuan Author-X-Name-Last: Ma Author-Name: Maria-Pia Victoria-Feser Author-X-Name-First: Maria-Pia Author-X-Name-Last: Victoria-Feser Title: Simulation-Based Bias Correction Methods for Complex Models Abstract: Along with the ever increasing data size and model complexity, an important challenge frequently encountered in constructing new estimators or in implementing a classical one such as the maximum likelihood estimator, is the computational aspect of the estimation procedure. To carry out estimation, approximate methods such as pseudo-likelihood functions or approximated estimating equations are increasingly used in practice as these methods are typically easier to implement numerically although they can lead to inconsistent and/or biased estimators. In this context, we extend and provide refinements on the known bias correction properties of two simulation-based methods, respectively, indirect inference and bootstrap, each with two alternatives. These results allow one to build a framework defining simulation-based estimators that can be implemented for complex models. Indeed, based on a biased or even inconsistent estimator, several simulation-based methods can be used to define new estimators that are both consistent and with reduced finite sample bias. This framework includes the classical method of the indirect inference for bias correction without requiring specification of an auxiliary model. We demonstrate the equivalence between one version of the indirect inference and the iterative bootstrap, both correct sample biases up to the order n− 3. The iterative method can be thought of as a computationally efficient algorithm to solve the optimization problem of the indirect inference. Our results provide different tools to correct the asymptotic as well as finite sample biases of estimators and give insight on which method should be applied for the problem at hand. The usefulness of the proposed approach is illustrated with the estimation of robust income distributions and generalized linear latent variable models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 146-157 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1380031 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1380031 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:146-157 Template-Type: ReDIF-Article 1.0 Author-Name: Jingshu Wang Author-X-Name-First: Jingshu Author-X-Name-Last: Wang Author-Name: Art B. Owen Author-X-Name-First: Art B. Author-X-Name-Last: Owen Title: Admissibility in Partial Conjunction Testing Abstract: Meta-analysis combines results from multiple studies aiming to increase power in finding their common effect. It would typically reject the null hypothesis of no effect if any one of the studies shows strong significance. The partial conjunction null hypothesis is rejected only when at least r of n component hypotheses are nonnull with r = 1 corresponding to a usual meta-analysis. Compared with meta-analysis, it can encourage replicable findings across studies. A by-product of it when applied to different r values is a confidence interval of r quantifying the proportion of nonnull studies. Benjamini and Heller (2008) provided a valid test for the partial conjunction null by ignoring the r − 1 smallest p-values and applying a valid meta-analysis p-value to the remaining n − r + 1 p-values. We provide sufficient and necessary conditions of admissible combined p-value for the partial conjunction hypothesis among monotone tests. Non-monotone tests always dominate monotone tests but are usually too unreasonable to be used in practice. Based on these findings, we propose a generalized form of Benjamini and Heller’s test which allows usage of various types of meta-analysis p-values, and apply our method to an example in assessing replicable benefit of new anticoagulants across subgroups of patients for stroke prevention. Journal: Journal of the American Statistical Association Pages: 158-168 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1385465 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1385465 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:158-168 Template-Type: ReDIF-Article 1.0 Author-Name: Paul Fearnhead Author-X-Name-First: Paul Author-X-Name-Last: Fearnhead Author-Name: Guillem Rigaill Author-X-Name-First: Guillem Author-X-Name-Last: Rigaill Title: Changepoint Detection in the Presence of Outliers Abstract: Many traditional methods for identifying changepoints can struggle in the presence of outliers, or when the noise is heavy-tailed. Often they will infer additional changepoints to fit the outliers. To overcome this problem, data often needs to be preprocessed to remove outliers, though this is difficult for applications where the data needs to be analyzed online. We present an approach to changepoint detection that is robust to the presence of outliers. The idea is to adapt existing penalized cost approaches for detecting changes so that they use loss functions that are less sensitive to outliers. We argue that loss functions that are bounded, such as the classical biweight loss, are particularly suitable—as we show that only bounded loss functions are robust to arbitrarily extreme outliers. We present an efficient dynamic programming algorithm that can find the optimal segmentation under our penalized cost criteria. Importantly, this algorithm can be used in settings where the data needs to be analyzed online. We show that we can consistently estimate the number of changepoints, and accurately estimate their locations, using the biweight loss function. We demonstrate the usefulness of our approach for applications such as analyzing well-log data, detecting copy number variation, and detecting tampering of wireless devices. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 169-183 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1385466 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1385466 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:169-183 Template-Type: ReDIF-Article 1.0 Author-Name: Yang Ni Author-X-Name-First: Yang Author-X-Name-Last: Ni Author-Name: Francesco C. Stingo Author-X-Name-First: Francesco C. Author-X-Name-Last: Stingo Author-Name: Veerabhadran Baladandayuthapani Author-X-Name-First: Veerabhadran Author-X-Name-Last: Baladandayuthapani Title: Bayesian Graphical Regression Abstract: We consider the problem of modeling conditional independence structures in heterogenous data in the presence of additional subject-level covariates—termed graphical regression. We propose a novel specification of a conditional (in)dependence function of covariates—which allows the structure of a directed graph to vary flexibly with the covariates; imposes sparsity in both edge and covariate selection; produces both subject-specific and predictive graphs; and is computationally tractable. We provide theoretical justifications of our modeling endeavor, in terms of graphical model selection consistency. We demonstrate the performance of our method through rigorous simulation studies. We illustrate our approach in a cancer genomics-based precision medicine paradigm, where-in we explore gene regulatory networks in multiple myeloma taking prognostic clinical factors into account to obtain both population-level and subject-level gene regulatory networks. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 184-197 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1389739 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1389739 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:184-197 Template-Type: ReDIF-Article 1.0 Author-Name: Xiaojun Mao Author-X-Name-First: Xiaojun Author-X-Name-Last: Mao Author-Name: Song Xi Chen Author-X-Name-First: Song Xi Author-X-Name-Last: Chen Author-Name: Raymond K. W. Wong Author-X-Name-First: Raymond K. W. Author-X-Name-Last: Wong Title: Matrix Completion With Covariate Information Abstract: This article investigates the problem of matrix completion from the corrupted data, when the additional covariates are available. Despite being seldomly considered in the matrix completion literature, these covariates often provide valuable information for completing the unobserved entries of the high-dimensional target matrix A0. Given a covariate matrix X with its rows representing the row covariates of A0, we consider a column-space-decomposition model A0 = Xβ0 + B0, where β0 is a coefficient matrix and B0 is a low-rank matrix orthogonal to X in terms of column space. This model facilitates a clear separation between the interpretable covariate effects (Xβ0) and the flexible hidden factor effects (B0). Besides, our work allows the probabilities of observation to depend on the covariate matrix, and hence a missing-at-random mechanism is permitted. We propose a novel penalized estimator for A0 by utilizing both Frobenius-norm and nuclear-norm regularizations with an efficient and scalable algorithm. Asymptotic convergence rates of the proposed estimators are studied. The empirical performance of the proposed methodology is illustrated via both numerical experiments and a real data application. Journal: Journal of the American Statistical Association Pages: 198-210 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1389740 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1389740 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:198-210 Template-Type: ReDIF-Article 1.0 Author-Name: Xinghao Qiao Author-X-Name-First: Xinghao Author-X-Name-Last: Qiao Author-Name: Shaojun Guo Author-X-Name-First: Shaojun Author-X-Name-Last: Guo Author-Name: Gareth M. James Author-X-Name-First: Gareth M. Author-X-Name-Last: James Title: Functional Graphical Models Abstract: Graphical models have attracted increasing attention in recent years, especially in settings involving high-dimensional data. In particular, Gaussian graphical models are used to model the conditional dependence structure among multiple Gaussian random variables. As a result of its computational efficiency, the graphical lasso (glasso) has become one of the most popular approaches for fitting high-dimensional graphical models. In this paper, we extend the graphical models concept to model the conditional dependence structure among p random functions. In this setting, not only is p large, but each function is itself a high-dimensional object, posing an additional level of statistical and computational complexity. We develop an extension of the glasso criterion (fglasso), which estimates the functional graphical model by imposing a block sparsity constraint on the precision matrix, via a group lasso penalty. The fglasso criterion can be optimized using an efficient block coordinate descent algorithm. We establish the concentration inequalities of the estimates, which guarantee the desirable graph support recovery property, that is, with probability tending to one, the fglasso will correctly identify the true conditional dependence structure. Finally, we show that the fglasso significantly outperforms possible competing methods through both simulations and an analysis of a real-world electroencephalography dataset comparing alcoholic and nonalcoholic patients. Journal: Journal of the American Statistical Association Pages: 211-222 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1390466 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1390466 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:211-222 Template-Type: ReDIF-Article 1.0 Author-Name: Mauricio Sadinle Author-X-Name-First: Mauricio Author-X-Name-Last: Sadinle Author-Name: Jing Lei Author-X-Name-First: Jing Author-X-Name-Last: Lei Author-Name: Larry Wasserman Author-X-Name-First: Larry Author-X-Name-Last: Wasserman Title: Least Ambiguous Set-Valued Classifiers With Bounded Error Levels Abstract: In most classification tasks, there are observations that are ambiguous and therefore difficult to correctly label. Set-valued classifiers output sets of plausible labels rather than a single label, thereby giving a more appropriate and informative treatment to the labeling of ambiguous instances. We introduce a framework for multiclass set-valued classification, where the classifiers guarantee user-defined levels of coverage or confidence (the probability that the true label is contained in the set) while minimizing the ambiguity (the expected size of the output). We first derive oracle classifiers assuming the true distribution to be known. We show that the oracle classifiers are obtained from level sets of the functions that define the conditional probability of each class. Then we develop estimators with good asymptotic and finite sample properties. The proposed estimators build on existing single-label classifiers. The optimal classifier can sometimes output the empty set, but we provide two solutions to fix this issue that are suitable for various practical needs. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 223-234 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1395341 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1395341 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:223-234 Template-Type: ReDIF-Article 1.0 Author-Name: Amy Willis Author-X-Name-First: Amy Author-X-Name-Last: Willis Title: Confidence Sets for Phylogenetic Trees Abstract: Inferring evolutionary histories (phylogenetic trees) has important applications in biology, criminology, and public health. However, phylogenetic trees are complex mathematical objects that reside in a non-Euclidean space, which complicates their analysis. While our mathematical, algorithmic, and probabilistic understanding of phylogenies in their metric space is mature, rigorous inferential infrastructure is as yet undeveloped. In this manuscript, we unify recent computational and probabilistic advances to construct tree–valued confidence sets. The procedure accounts for both center and multiple directions of tree–valued variability. We draw on block replicates to improve testing, identifying the best supported most recent ancestor of the Zika virus, and formally testing the hypothesis that a Floridian dentist with AIDS infected two of his patients with HIV. The method illustrates connections between variability in Euclidean and tree space, opening phylogenetic tree analysis to techniques available in the multivariate Euclidean setting. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 235-244 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1395342 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1395342 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:235-244 Template-Type: ReDIF-Article 1.0 Author-Name: Li Ma Author-X-Name-First: Li Author-X-Name-Last: Ma Author-Name: Jialiang Mao Author-X-Name-First: Jialiang Author-X-Name-Last: Mao Title: Fisher Exact Scanning for Dependency Abstract: We introduce a method—called Fisher exact scanning (FES)—for testing and identifying variable dependency that generalizes Fisher’s exact test on 2 × 2 contingency tables to R × C contingency tables and continuous sample spaces. FES proceeds through scanning over the sample space using windows in the form of 2 × 2 tables of various sizes, and on each window completing a Fisher’s exact test. Based on a factorization of Fisher’s multivariate hypergeometric (MHG) likelihood into the product of the univariate hypergeometric likelihoods, we show that there exists a coarse-to-fine, sequential generative representation for the MHG model in the form of a Bayesian network, which in turn implies the mutual independence (up to deviation due to discreteness) among the Fisher’s exact tests completed under FES. This allows an exact characterization of the joint null distribution of the p-values and gives rise to an effective inference recipe through simple multiple testing procedures such as Šidák and Bonferroni corrections, eliminating the need for resampling. In addition, FES can characterize dependency through reporting significant windows after multiple testing control. The computational complexity of FES is approximately linear in the sample size, which along with the avoidance of resampling makes it ideal for analyzing massive datasets. We use extensive numerical studies to illustrate the work of FES and compare it to several state-of-the-art methods for testing dependency in both statistical and computational performance. Finally, we apply FES to analyzing a microbiome dataset and further investigate its relationship with other popular dependency metrics in that context. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 245-258 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1397522 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1397522 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:245-258 Template-Type: ReDIF-Article 1.0 Author-Name: Anna Bellach Author-X-Name-First: Anna Author-X-Name-Last: Bellach Author-Name: Michael R. Kosorok Author-X-Name-First: Michael R. Author-X-Name-Last: Kosorok Author-Name: Ludger Rüschendorf Author-X-Name-First: Ludger Author-X-Name-Last: Rüschendorf Author-Name: Jason P. Fine Author-X-Name-First: Jason P. Author-X-Name-Last: Fine Title: Weighted NPMLE for the Subdistribution of a Competing Risk Abstract: Direct regression modeling of the subdistribution has become popular for analyzing data with multiple, competing event types. All general approaches so far are based on nonlikelihood-based procedures and target covariate effects on the subdistribution. We introduce a novel weighted likelihood function that allows for a direct extension of the Fine–Gray model to a broad class of semiparametric regression models. The model accommodates time-dependent covariate effects on the subdistribution hazard. To motivate the proposed likelihood method, we derive standard nonparametric estimators and discuss a new interpretation based on pseudo risk sets. We establish consistency and asymptotic normality of the estimators and propose a sandwich estimator of the variance. In comprehensive simulation studies, we demonstrate the solid performance of the weighted nonparametric maximum likelihood estimation in the presence of independent right censoring. We provide an application to a very large bone marrow transplant dataset, thereby illustrating its practical utility. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 259-270 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1401540 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1401540 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:259-270 Template-Type: ReDIF-Article 1.0 Author-Name: Yang Li Author-X-Name-First: Yang Author-X-Name-Last: Li Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Title: Robust Variable and Interaction Selection for Logistic Regression and General Index Models Abstract: Under the logistic regression framework, we propose a forward-backward method, SODA, for variable selection with both main and quadratic interaction terms. In the forward stage, SODA adds in predictors that have significant overall effects, whereas in the backward stage SODA removes unimportant terms to optimize the extended Bayesian information criterion (EBIC). Compared with existing methods for variable selection in quadratic discriminant analysis, SODA can deal with high-dimensional data in which the number of predictors is much larger than the sample size and does not require the joint normality assumption on predictors, leading to much enhanced robustness. We further extend SODA to conduct variable selection and model fitting for general index models. Compared with existing variable selection methods based on the sliced inverse regression (SIR), SODA requires neither linearity nor constant variance condition and is thus more robust. Our theoretical analysis establishes the variable-selection consistency of SODA under high-dimensional settings, and our simulation studies as well as real-data applications demonstrate superior performances of SODA in dealing with non-Gaussian design matrices in both logistic and general index models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 271-286 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1401541 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1401541 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:271-286 Template-Type: ReDIF-Article 1.0 Author-Name: Yacine Aït-Sahalia Author-X-Name-First: Yacine Author-X-Name-Last: Aït-Sahalia Author-Name: Dacheng Xiu Author-X-Name-First: Dacheng Author-X-Name-Last: Xiu Title: Principal Component Analysis of High-Frequency Data Abstract: We develop the necessary methodology to conduct principal component analysis at high frequency. We construct estimators of realized eigenvalues, eigenvectors, and principal components, and provide the asymptotic distribution of these estimators. Empirically, we study the high-frequency covariance structure of the constituents of the S&P 100 Index using as little as one week of high-frequency data at a time, and examines whether it is compatible with the evidence accumulated over decades of lower frequency returns. We find a surprising consistency between the low- and high-frequency structures. During the recent financial crisis, the first principal component becomes increasingly dominant, explaining up to 60% of the variation on its own, while the second principal component drives the common variation of financial sector stocks. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 287-303 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1401542 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1401542 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:287-303 Template-Type: ReDIF-Article 1.0 Author-Name: Peng Ding Author-X-Name-First: Peng Author-X-Name-Last: Ding Author-Name: Avi Feller Author-X-Name-First: Avi Author-X-Name-Last: Feller Author-Name: Luke Miratrix Author-X-Name-First: Luke Author-X-Name-Last: Miratrix Title: Decomposing Treatment Effect Variation Abstract: Understanding and characterizing treatment effect variation in randomized experiments has become essential for going beyond the “black box” of the average treatment effect. Nonetheless, traditional statistical approaches often ignore or assume away such variation. In the context of randomized experiments, this article proposes a framework for decomposing overall treatment effect variation into a systematic component explained by observed covariates and a remaining idiosyncratic component. Our framework is fully randomization-based, with estimates of treatment effect variation that are entirely justified by the randomization itself. Our framework can also account for noncompliance, which is an important practical complication. We make several contributions. First, we show that randomization-based estimates of systematic variation are very similar in form to estimates from fully interacted linear regression and two-stage least squares. Second, we use these estimators to develop an omnibus test for systematic treatment effect variation, both with and without noncompliance. Third, we propose an R2-like measure of treatment effect variation explained by covariates and, when applicable, noncompliance. Finally, we assess these methods via simulation studies and apply them to the Head Start Impact Study, a large-scale randomized experiment. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 304-317 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1407322 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407322 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:304-317 Template-Type: ReDIF-Article 1.0 Author-Name: Rahul Mazumder Author-X-Name-First: Rahul Author-X-Name-Last: Mazumder Author-Name: Arkopal Choudhury Author-X-Name-First: Arkopal Author-X-Name-Last: Choudhury Author-Name: Garud Iyengar Author-X-Name-First: Garud Author-X-Name-Last: Iyengar Author-Name: Bodhisattva Sen Author-X-Name-First: Bodhisattva Author-X-Name-Last: Sen Title: A Computational Framework for Multivariate Convex Regression and Its Variants Abstract: We study the nonparametric least squares estimator (LSE) of a multivariate convex regression function. The LSE, given as the solution to a quadratic program with O(n2) linear constraints (n being the sample size), is difficult to compute for large problems. Exploiting problem specific structure, we propose a scalable algorithmic framework based on the augmented Lagrangian method to compute the LSE. We develop a novel approach to obtain smooth convex approximations to the fitted (piecewise affine) convex LSE and provide formal bounds on the quality of approximation. When the number of samples is not too large compared to the dimension of the predictor, we propose a regularization scheme—Lipschitz convex regression—where we constrain the norm of the subgradients, and study the rates of convergence of the obtained LSE. Our algorithmic framework is simple and flexible and can be easily adapted to handle variants: estimation of a nondecreasing/nonincreasing convex/concave (with or without a Lipschitz bound) function. We perform numerical studies illustrating the scalability of the proposed algorithm—on some instances our proposal leads to more than a 10,000-fold improvement in runtime when compared to off-the-shelf interior point solvers for problems with n = 500. Journal: Journal of the American Statistical Association Pages: 318-331 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1407771 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407771 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:318-331 Template-Type: ReDIF-Article 1.0 Author-Name: Benjamin B. Risk Author-X-Name-First: Benjamin B. Author-X-Name-Last: Risk Author-Name: David S. Matteson Author-X-Name-First: David S. Author-X-Name-Last: Matteson Author-Name: David Ruppert Author-X-Name-First: David Author-X-Name-Last: Ruppert Title: Linear Non-Gaussian Component Analysis Via Maximum Likelihood Abstract: Independent component analysis (ICA) is popular in many applications, including cognitive neuroscience and signal processing. Due to computational constraints, principal component analysis (PCA) is used for dimension reduction prior to ICA (PCA+ICA), which could remove important information. The problem is that interesting independent components (ICs) could be mixed in several principal components that are discarded and then these ICs cannot be recovered. We formulate a linear non-Gaussian component model with Gaussian noise components. To estimate the model parameters, we propose likelihood component analysis (LCA), in which dimension reduction and latent variable estimation are achieved simultaneously. Our method orders components by their marginal likelihood rather than ordering components by variance as in PCA. We present a parametric LCA using the logistic density and a semiparametric LCA using tilted Gaussians with cubic B-splines. Our algorithm is scalable to datasets common in applications (e.g., hundreds of thousands of observations across hundreds of variables with dozens of latent components). In simulations, latent components are recovered that are discarded by PCA+ICA methods. We apply our method to multivariate data and demonstrate that LCA is a useful data visualization and dimension reduction tool that reveals features not apparent from PCA or PCA+ICA. We also apply our method to a functional magnetic resonance imaging experiment from the Human Connectome Project and identify artifacts missed by PCA+ICA. We present theoretical results on identifiability of the linear non-Gaussian component model and consistency of LCA. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 332-343 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1407772 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407772 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:332-343 Template-Type: ReDIF-Article 1.0 Author-Name: S. Luo Author-X-Name-First: S. Author-X-Name-Last: Luo Author-Name: R. Song Author-X-Name-First: R. Author-X-Name-Last: Song Author-Name: M. Styner Author-X-Name-First: M. Author-X-Name-Last: Styner Author-Name: J. H. Gilmore Author-X-Name-First: J. H. Author-X-Name-Last: Gilmore Author-Name: H. Zhu Author-X-Name-First: H. Author-X-Name-Last: Zhu Title: FSEM: Functional Structural Equation Models for Twin Functional Data Abstract: The aim of this article is to develop a novel class of functional structural equation models (FSEMs) for dissecting functional genetic and environmental effects on twin functional data, while characterizing the varying association between functional data and covariates of interest. We propose a three-stage estimation procedure to estimate varying coefficient functions for various covariates (e.g., gender) as well as three covariance operators for the genetic and environmental effects. We develop an inference procedure based on weighted likelihood ratio statistics to test the genetic/environmental effect at either a fixed location or a compact region. We also systematically carry out the theoretical analysis of the estimated varying functions, the weighted likelihood ratio statistics, and the estimated covariance operators. We conduct extensive Monte Carlo simulations to examine the finite-sample performance of the estimation and inference procedures. We apply the proposed FSEM to quantify the degree of genetic and environmental effects on twin white matter tracts obtained from the UNC early brain development study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 344-357 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1407773 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407773 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:344-357 Template-Type: ReDIF-Article 1.0 Author-Name: Zijian Guo Author-X-Name-First: Zijian Author-X-Name-Last: Guo Author-Name: Wanjie Wang Author-X-Name-First: Wanjie Author-X-Name-Last: Wang Author-Name: T. Tony Cai Author-X-Name-First: T. Tony Author-X-Name-Last: Cai Author-Name: Hongzhe Li Author-X-Name-First: Hongzhe Author-X-Name-Last: Li Title: Optimal Estimation of Genetic Relatedness in High-Dimensional Linear Models Abstract: Estimating the genetic relatedness between two traits based on the genome-wide association data is an important problem in genetics research. In the framework of high-dimensional linear models, we introduce two measures of genetic relatedness and develop optimal estimators for them. One is genetic covariance, which is defined to be the inner product of the two regression vectors, and another is genetic correlation, which is a normalized inner product by their lengths. We propose functional de-biased estimators (FDEs), which consist of an initial estimation step with the plug-in scaled Lasso estimator, and a further bias correction step. We also develop estimators of the quadratic functionals of the regression vectors, which can be used to estimate the heritability of each trait. The estimators are shown to be minimax rate-optimal and can be efficiently implemented. Simulation results show that FDEs provide better estimates of the genetic relatedness than simple plug-in estimates. FDE is also applied to an analysis of a yeast segregant dataset with multiple traits to estimate the genetic relatedness among these traits. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 358-369 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1407774 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407774 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:358-369 Template-Type: ReDIF-Article 1.0 Author-Name: Jon Arni Steingrimsson Author-X-Name-First: Jon Arni Author-X-Name-Last: Steingrimsson Author-Name: Liqun Diao Author-X-Name-First: Liqun Author-X-Name-Last: Diao Author-Name: Robert L. Strawderman Author-X-Name-First: Robert L. Author-X-Name-Last: Strawderman Title: Censoring Unbiased Regression Trees and Ensembles Abstract: This article proposes a novel paradigm for building regression trees and ensemble learning in survival analysis. Generalizations of the classification and regression trees (CART) and random forests (RF) algorithms for general loss functions, and in the latter case more general bootstrap procedures, are both introduced. These results, in combination with an extension of the theory of censoring unbiased transformations (CUTs) applicable to loss functions, underpin the development of two new classes of algorithms for constructing survival trees and survival forests: censoring unbiased regression trees and censoring unbiased regression ensembles. For a certain “doubly robust” CUT of squared error loss, we further show how these new algorithms can be implemented using existing software (e.g., CART, RF). Comparisons of these methods to existing ensemble procedures for predicting survival probabilities are provided in both simulated settings and through applications to four datasets. It is shown that these new methods either improve upon, or remain competitive with, existing implementations of random survival forests, conditional inference forests, and recursively imputed survival trees. Journal: Journal of the American Statistical Association Pages: 370-383 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1407775 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407775 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:370-383 Template-Type: ReDIF-Article 1.0 Author-Name: Yaowu Liu Author-X-Name-First: Yaowu Author-X-Name-Last: Liu Author-Name: Jun Xie Author-X-Name-First: Jun Author-X-Name-Last: Xie Title: Accurate and Efficient P-value Calculation Via Gaussian Approximation: A Novel Monte-Carlo Method Abstract: It is of fundamental interest in statistics to test the significance of a set of covariates. For example, in genome-wide association studies, a joint null hypothesis of no genetic effect is tested for a set of multiple genetic variants. The minimum p-value method, higher criticism, and Berk–Jones tests are particularly effective when the covariates with nonzero effects are sparse. However, the correlations among covariates and the nonGaussian distribution of the response pose a great challenge toward the p-value calculation of the three tests. In practice, permutation is commonly used to obtain accurate p-values, but it is computationally very intensive, especially when we need to conduct a large amount of hypothesis testing. In this paper, we propose a Gaussian approximation method based on a Monte Carlo scheme, which is computationally more efficient than permutation while still achieving similar accuracy. We derive nonasymptotic approximation error bounds that could vanish in the limit even if the number of covariates is much larger than the sample size. Through real-genotype-based simulations and data analysis of a genome-wide association study of Crohn’s disease, we compare the accuracy and computation cost of our proposed method, of permutation, and of the method based on asymptotic distribution. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 384-392 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1407776 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407776 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:384-392 Template-Type: ReDIF-Article 1.0 Author-Name: Cody Alsaker Author-X-Name-First: Cody Author-X-Name-Last: Alsaker Author-Name: F. Jay Breidt Author-X-Name-First: F. Jay Author-X-Name-Last: Breidt Author-Name: Mark J. van der Woerd Author-X-Name-First: Mark J. Author-X-Name-Last: van der Woerd Title: Minimum Mean Squared Error Estimation of the Radius of Gyration in Small-Angle X-Ray Scattering Experiments Abstract: Small-angle X-ray scattering (SAXS) is a technique that yields low-resolution structural information of biological macromolecules by exposing a large ensemble of molecules in solution to a powerful X-ray beam. The beam interacts with the molecules and the intensity of the scattered beam is recorded on a detector plate. The radius of gyration for a molecule, which is a measure of the spread of its mass, can be estimated from the lowest scattering angles of SAXS data. This estimation method requires specification of a window of scattering angles. Under a local polynomial model with autoregressive errors, we develop methodology and supporting asymptotic theory for selection of an optimal window, minimum mean square error estimation of the radius of gyration, and estimation of its variance. Simulation studies confirm the quality of our asymptotic approximations and the superior performance of the proposed methodology relative to the accepted standard. Our semi-automated methodology makes it feasible to estimate the radius of gyration many times, from replicated SAXS data under various experimental conditions, in an objective and reproducible manner. This in turn allows for secondary analyses of the dataset of estimates, as we demonstrate with a split–split plot analysis for 357 SAXS intensity curves. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 39-47 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1408467 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1408467 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:39-47 Template-Type: ReDIF-Article 1.0 Author-Name: HaiYing Wang Author-X-Name-First: HaiYing Author-X-Name-Last: Wang Author-Name: Min Yang Author-X-Name-First: Min Author-X-Name-Last: Yang Author-Name: John Stufken Author-X-Name-First: John Author-X-Name-Last: Stufken Title: Information-Based Optimal Subdata Selection for Big Data Linear Regression Abstract: Extraordinary amounts of data are being produced in many branches of science. Proven statistical methods are no longer applicable with extraordinary large datasets due to computational limitations. A critical step in big data analysis is data reduction. Existing investigations in the context of linear regression focus on subsampling-based methods. However, not only is this approach prone to sampling errors, it also leads to a covariance matrix of the estimators that is typically bounded from below by a term that is of the order of the inverse of the subdata size. We propose a novel approach, termed information-based optimal subdata selection (IBOSS). Compared to leading existing subdata methods, the IBOSS approach has the following advantages: (i) it is significantly faster; (ii) it is suitable for distributed parallel computing; (iii) the variances of the slope parameter estimators converge to 0 as the full data size increases even if the subdata size is fixed, that is, the convergence rate depends on the full data size; (iv) data analysis for IBOSS subdata is straightforward and the sampling distribution of an IBOSS estimator is easy to assess. Theoretical results and extensive simulations demonstrate that the IBOSS approach is superior to subsampling-based methods, sometimes by orders of magnitude. The advantages of the new approach are also illustrated through analysis of real data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 393-405 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1408468 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1408468 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:393-405 Template-Type: ReDIF-Article 1.0 Author-Name: Raymond K. W. Wong Author-X-Name-First: Raymond K. W. Author-X-Name-Last: Wong Author-Name: Yehua Li Author-X-Name-First: Yehua Author-X-Name-Last: Li Author-Name: Zhengyuan Zhu Author-X-Name-First: Zhengyuan Author-X-Name-Last: Zhu Title: Partially Linear Functional Additive Models for Multivariate Functional Data Abstract: We investigate a class of partially linear functional additive models (PLFAM) that predicts a scalar response by both parametric effects of a multivariate predictor and nonparametric effects of a multivariate functional predictor. We jointly model multiple functional predictors that are cross-correlated using multivariate functional principal component analysis (mFPCA), and model the nonparametric effects of the principal component scores as additive components in the PLFAM. To address the high-dimensional nature of functional data, we let the number of mFPCA components diverge to infinity with the sample size, and adopt the component selection and smoothing operator (COSSO) penalty to select relevant components and regularize the fitting. A fundamental difference between our framework and the existing high-dimensional additive models is that the mFPCA scores are estimated with error, and the magnitude of measurement error increases with the order of mFPCA. We establish the asymptotic convergence rate for our estimator, while allowing the number of components diverge. When the number of additive components is fixed, we also establish the asymptotic distribution for the partially linear coefficients. The practical performance of the proposed methods is illustrated via simulation studies and a crop yield prediction application. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 406-418 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1411268 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1411268 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:406-418 Template-Type: ReDIF-Article 1.0 Author-Name: Damian Brzyski Author-X-Name-First: Damian Author-X-Name-Last: Brzyski Author-Name: Alexej Gossmann Author-X-Name-First: Alexej Author-X-Name-Last: Gossmann Author-Name: Weijie Su Author-X-Name-First: Weijie Author-X-Name-Last: Su Author-Name: Małgorzata Bogdan Author-X-Name-First: Małgorzata Author-X-Name-Last: Bogdan Title: Group SLOPE – Adaptive Selection of Groups of Predictors Abstract: Sorted L-One Penalized Estimation (SLOPE; Bogdan et al. 2013, 2015) is a relatively new convex optimization procedure, which allows for adaptive selection of regressors under sparse high-dimensional designs. Here, we extend the idea of SLOPE to deal with the situation when one aims at selecting whole groups of explanatory variables instead of single regressors. Such groups can be formed by clustering strongly correlated predictors or groups of dummy variables corresponding to different levels of the same qualitative predictor. We formulate the respective convex optimization problem, group SLOPE (gSLOPE), and propose an efficient algorithm for its solution. We also define a notion of the group false discovery rate (gFDR) and provide a choice of the sequence of tuning parameters for gSLOPE so that gFDR is provably controlled at a prespecified level if the groups of variables are orthogonal to each other. Moreover, we prove that the resulting procedure adapts to unknown sparsity and is asymptotically minimax with respect to the estimation of the proportions of variance of the response variable explained by regressors from different groups. We also provide a method for the choice of the regularizing sequence when variables in different groups are not orthogonal but statistically independent and illustrate its good properties with computer simulations. Finally, we illustrate the advantages of gSLOPE in the context of Genome Wide Association Studies. R package grpSLOPE with an implementation of our method is available on The Comprehensive R Archive Network. Journal: Journal of the American Statistical Association Pages: 419-433 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1411269 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1411269 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:419-433 Template-Type: ReDIF-Article 1.0 Author-Name: Raphaël Huser Author-X-Name-First: Raphaël Author-X-Name-Last: Huser Author-Name: Jennifer L. Wadsworth Author-X-Name-First: Jennifer L. Author-X-Name-Last: Wadsworth Title: Modeling Spatial Processes with Unknown Extremal Dependence Class Abstract: Many environmental processes exhibit weakening spatial dependence as events become more extreme. Well-known limiting models, such as max-stable or generalized Pareto processes, cannot capture this, which can lead to a preference for models that exhibit a property known as asymptotic independence. However, weakening dependence does not automatically imply asymptotic independence, and whether the process is truly asymptotically (in)dependent is usually far from clear. The distinction is key as it can have a large impact upon extrapolation, that is, the estimated probabilities of events more extreme than those observed. In this work, we present a single spatial model that is able to capture both dependence classes in a parsimonious manner, and with a smooth transition between the two cases. The model covers a wide range of possibilities from asymptotic independence through to complete dependence, and permits weakening dependence of extremes even under asymptotic dependence. Censored likelihood-based inference for the implied copula is feasible in moderate dimensions due to closed-form margins. The model is applied to oceanographic datasets with ambiguous true limiting dependence structure. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 434-444 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1411813 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1411813 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:434-444 Template-Type: ReDIF-Article 1.0 Author-Name: Geir-Arne Fuglstad Author-X-Name-First: Geir-Arne Author-X-Name-Last: Fuglstad Author-Name: Daniel Simpson Author-X-Name-First: Daniel Author-X-Name-Last: Simpson Author-Name: Finn Lindgren Author-X-Name-First: Finn Author-X-Name-Last: Lindgren Author-Name: Håvard Rue Author-X-Name-First: Håvard Author-X-Name-Last: Rue Title: Constructing Priors that Penalize the Complexity of Gaussian Random Fields Abstract: Priors are important for achieving proper posteriors with physically meaningful covariance structures for Gaussian random fields (GRFs) since the likelihood typically only provides limited information about the covariance structure under in-fill asymptotics. We extend the recent penalized complexity prior framework and develop a principled joint prior for the range and the marginal variance of one-dimensional, two-dimensional, and three-dimensional Matérn GRFs with fixed smoothness. The prior is weakly informative and penalizes complexity by shrinking the range toward infinity and the marginal variance toward zero. We propose guidelines for selecting the hyperparameters, and a simulation study shows that the new prior provides a principled alternative to reference priors that can leverage prior knowledge to achieve shorter credible intervals while maintaining good coverage.We extend the prior to a nonstationary GRF parameterized through local ranges and marginal standard deviations, and introduce a scheme for selecting the hyperparameters based on the coverage of the parameters when fitting simulated stationary data. The approach is applied to a dataset of annual precipitation in southern Norway and the scheme for selecting the hyperparameters leads to conservative estimates of nonstationarity and improved predictive performance over the stationary model. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 445-452 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1415907 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1415907 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:445-452 Template-Type: ReDIF-Article 1.0 Author-Name: Zeda Li Author-X-Name-First: Zeda Author-X-Name-Last: Li Author-Name: Robert T. Krafty Author-X-Name-First: Robert T. Author-X-Name-Last: Krafty Title: Adaptive Bayesian Time–Frequency Analysis of Multivariate Time Series Abstract: This article introduces a nonparametric approach to multivariate time-varying power spectrum analysis. The procedure adaptively partitions a time series into an unknown number of approximately stationary segments, where some spectral components may remain unchanged across segments, allowing components to evolve differently over time. Local spectra within segments are fit through Whittle likelihood-based penalized spline models of modified Cholesky components, which provide flexible nonparametric estimates that preserve positive definite structures of spectral matrices. The approach is formulated in a Bayesian framework, in which the number and location of partitions are random, and relies on reversible jump Markov chain and Hamiltonian Monte Carlo methods that can adapt to the unknown number of segments and parameters. By averaging over the distribution of partitions, the approach can approximate both abrupt and slowly varying changes in spectral matrices. Empirical performance is evaluated in simulation studies and illustrated through analyses of electroencephalography during sleep and of the El Niño-Southern Oscillation. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 453-465 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1415908 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1415908 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:453-465 Template-Type: ReDIF-Article 1.0 Author-Name: Marco Di Marzio Author-X-Name-First: Marco Author-X-Name-Last: Di Marzio Author-Name: Agnese Panzera Author-X-Name-First: Agnese Author-X-Name-Last: Panzera Author-Name: Charles C. Taylor Author-X-Name-First: Charles C. Author-X-Name-Last: Taylor Title: Nonparametric Rotations for Sphere-Sphere Regression Abstract: Regression of data represented as points on a hypersphere has traditionally been treated using parametric families of transformations that include the simple rigid rotation as an important, special case. On the other hand, nonparametric methods have generally focused on modeling a scalar response through a spherical predictor by representing the regression function as a polynomial, leading to component-wise estimation of a spherical response. We propose a very flexible, simple regression model where for each location of the manifold a specific rotation matrix is to be estimated. To make this approach tractable, we assume continuity of the regression function that, in turn, allows for approximations of rotation matrices based on a series expansion. It is seen that the nonrigidity of our technique motivates an iterative estimation within a Newton–Raphson learning scheme, which exhibits bias reduction properties. Extensions to general shape matching are also outlined. Both simulations and real data are used to illustrate the results. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 466-476 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2017.1421542 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1421542 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:466-476 Template-Type: ReDIF-Article 1.0 Author-Name: Yang Ni Author-X-Name-First: Yang Author-X-Name-Last: Ni Author-Name: Francesco C. Stingo Author-X-Name-First: Francesco C. Author-X-Name-Last: Stingo Author-Name: Min Jin Ha Author-X-Name-First: Min Jin Author-X-Name-Last: Ha Author-Name: Rehan Akbani Author-X-Name-First: Rehan Author-X-Name-Last: Akbani Author-Name: Veerabhadran Baladandayuthapani Author-X-Name-First: Veerabhadran Author-X-Name-Last: Baladandayuthapani Title: Bayesian Hierarchical Varying-Sparsity Regression Models with Application to Cancer Proteogenomics Abstract: Identifying patient-specific prognostic biomarkers is of critical importance in developing personalized treatment for clinically and molecularly heterogeneous diseases such as cancer. In this article, we propose a novel regression framework, Bayesian hierarchical varying-sparsity regression (BEHAVIOR) models to select clinically relevant disease markers by integrating proteogenomic (proteomic+genomic) and clinical data. Our methods allow flexible modeling of protein–gene relationships as well as induces sparsity in both protein–gene and protein–survival relationships, to select genomically driven prognostic protein markers at the patient-level. Simulation studies demonstrate the superior performance of BEHAVIOR against competing method in terms of both protein marker selection and survival prediction. We apply BEHAVIOR to The Cancer Genome Atlas (TCGA) proteogenomic pan-cancer data and find several interesting prognostic proteins and pathways that are shared across multiple cancers and some that exclusively pertain to specific cancers. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available online. Journal: Journal of the American Statistical Association Pages: 48-60 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2018.1434529 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1434529 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:48-60 Template-Type: ReDIF-Article 1.0 Author-Name: Mark D. Risser Author-X-Name-First: Mark D. Author-X-Name-Last: Risser Author-Name: Christopher J. Paciorek Author-X-Name-First: Christopher J. Author-X-Name-Last: Paciorek Author-Name: Dáithí A. Stone Author-X-Name-First: Dáithí A. Author-X-Name-Last: Stone Title: Spatially Dependent Multiple Testing Under Model Misspecification, With Application to Detection of Anthropogenic Influence on Extreme Climate Events Abstract: The Weather Risk Attribution Forecast (WRAF) is a forecasting tool that uses output from global climate models to make simultaneous attribution statements about whether and how greenhouse gas emissions have contributed to extreme weather across the globe. However, in conducting a large number of simultaneous hypothesis tests, the WRAF is prone to identifying false “discoveries.” A common technique for addressing this multiple testing problem is to adjust the procedure in a way that controls the proportion of true null hypotheses that are incorrectly rejected, or the false discovery rate (FDR). Unfortunately, generic FDR procedures suffer from low power when the hypotheses are dependent, and techniques designed to account for dependence are sensitive to misspecification of the underlying statistical model. In this article, we develop a Bayesian decision-theoretical approach for dependent multiple testing and a nonparametric hierarchical statistical model that flexibly controls false discovery and is robust to model misspecification. We illustrate the robustness of our procedure to model error with a simulation study, using a framework that accounts for generic spatial dependence and allows the practitioner to flexibly specify the decision criteria. Finally, we apply our procedure to several seasonal forecasts and discuss implementation for the WRAF workflow. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 61-78 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2018.1451335 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1451335 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:61-78 Template-Type: ReDIF-Article 1.0 Author-Name: Kwonsang Lee Author-X-Name-First: Kwonsang Author-X-Name-Last: Lee Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Title: Estimating the Malaria Attributable Fever Fraction Accounting for Parasites Being Killed by Fever and Measurement Error Abstract: Malaria is a major health problem in many tropical regions. Fever is a characteristic symptom of malaria. The fraction of fevers that are attributable to malaria, the malaria attributable fever fraction (MAFF), is an important public health measure in that the MAFF can be used to calculate the number of fevers that would be avoided if malaria was eliminated. Despite such causal interpretation, the MAFF has not been considered in the framework of causal inference. We define the MAFF using the potential outcome framework, and define causal assumptions that current estimation methods rely on. Furthermore, we demonstrate that one of the assumptions—that the parasite density is correctly measured—generally does not hold because (i) fever kills some parasites and (ii) parasite density is measured with error. In the presence of these problems, we reveal that current MAFF estimators can be significantly biased. To develop a consistent estimator, we propose a novel maximum likelihood estimation method based on exponential family g-modeling. Under the assumption that the measurement error mechanism and the magnitude of the fever killing effect are known, we show that our proposed method provides approximately unbiased estimates of the MAFF in simulation studies. A sensitivity analysis is developed to assess the impact of different magnitudes of fever killing and different measurement error mechanisms. Finally, we apply our proposed method to estimate the MAFF in Kilombero, Tanzania. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 79-92 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2018.1469989 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469989 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:79-92 Template-Type: ReDIF-Article 1.0 Author-Name: Edward H. Kennedy Author-X-Name-First: Edward H. Author-X-Name-Last: Kennedy Author-Name: Steve Harris Author-X-Name-First: Steve Author-X-Name-Last: Harris Author-Name: Luke J. Keele Author-X-Name-First: Luke J. Author-X-Name-Last: Keele Title: Survivor-Complier Effects in the Presence of Selection on Treatment, With Application to a Study of Prompt ICU Admission Abstract: Pretreatment selection or censoring (“selection on treatment”) can occur when two treatment levels are compared ignoring the third option of neither treatment, in “censoring by death” settings where treatment is only defined for those who survive long enough to receive it, or in general in studies where the treatment is only defined for a subset of the population. Unfortunately, the standard instrumental variable (IV) estimand is not defined in the presence of such selection, so we consider estimating a new survivor-complier causal effect. Although this effect is generally not identified under standard IV assumptions, it is possible to construct sharp bounds. We derive these bounds and give a corresponding data-driven sensitivity analysis, along with nonparametric yet efficient estimation methods. Importantly, our approach allows for high-dimensional confounding adjustment, and valid inference even after employing machine learning. Incorporating covariates can tighten bounds dramatically, especially when they are strong predictors of the selection process. We apply the methods in a UK cohort study of critical care patients to examine the mortality effects of prompt admission to the intensive care unit, using ICU bed availability as an instrument. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 93-104 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2018.1469990 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469990 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:93-104 Template-Type: ReDIF-Article 1.0 Author-Name: Mamadou Yauck Author-X-Name-First: Mamadou Author-X-Name-Last: Yauck Author-Name: Louis-Paul Rivest Author-X-Name-First: Louis-Paul Author-X-Name-Last: Rivest Author-Name: Greg Rothman Author-X-Name-First: Greg Author-X-Name-Last: Rothman Title: Capture-Recapture Methods for Data on the Activation of Applications on Mobile Phones Abstract: This work is concerned with the analysis of marketing data on the activation of applications (apps) on mobile devices. Each application has a hashed identification number that is specific to the device on which it has been installed. This number can be registered by a platform at each activation of the application. Activations on the same device are linked together using the identification number. By focusing on activations that took place at a business location, one can create a capture-recapture dataset about devices, that is, users, that “visited” the business: the units are owners of mobile devices and the capture occasions are time intervals such as days. A unit is captured when she activates an application, provided that this activation is recorded by the platform providing the data. Statistical capture-recapture techniques can be applied to the app data to estimate the total number of users that visited the business over a time period, thereby providing an indirect estimate of foot traffic. This article argues that the robust design, a method for dealing with a nested mark-recapture experiment, can be used in this context. A new algorithm for estimating the parameters of a robust design with a fairly large number of capture occasions and a simple parametric bootstrap variance estimator are proposed. Moreover, new estimation methods and new theoretical results are introduced for a wider application of the robust design. This is used to analyze a dataset about the mobile devices that visited the auto-dealerships of a major auto brand in a U.S. metropolitan area over a period of 1 year and a half. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 105-114 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2018.1469991 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469991 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:105-114 Template-Type: ReDIF-Article 1.0 Author-Name: Anna Louise Schröder Author-X-Name-First: Anna Louise Author-X-Name-Last: Schröder Author-Name: Hernando Ombao Author-X-Name-First: Hernando Author-X-Name-Last: Ombao Title: FreSpeD: Frequency-Specific Change-Point Detection in Epileptic Seizure Multi-Channel EEG Data Abstract: The goal in this article is to develop a practical tool that identifies changes in the brain activity as recorded in electroencephalograms (EEG). Our method is devised to detect possibly subtle disruptions in normal brain functioning that precede the onset of an epileptic seizure. Moreover, it is able to capture the evolution of seizure spread from one region (or channel) to another. The proposed frequency-specific change-point detection method (FreSpeD) deploys a cumulative sum-type test statistic within a binary segmentation algorithm. We demonstrate the theoretical properties of FreSpeD and show its robustness to parameter choice and advantages against two competing methods. Furthermore, the FreSpeD method produces directly interpretable output. When applied to epileptic seizure EEG data, FreSpeD identifies the correct brain region as the focal point of seizure and the timing of the seizure onset. Moreover, FreSpeD detects changes in cross-coherence immediately before seizure onset which indicate an evolution leading up to the seizure. These changes are subtle and were not captured by the methods that previously analyzed the same EEG data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 115-128 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2018.1476238 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1476238 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:115-128 Template-Type: ReDIF-Article 1.0 Author-Name: Li Li Author-X-Name-First: Li Author-X-Name-Last: Li Author-Name: Alejandro Jara Author-X-Name-First: Alejandro Author-X-Name-Last: Jara Author-Name: María José García-Zattera Author-X-Name-First: María José Author-X-Name-Last: García-Zattera Author-Name: Timothy E. Hanson Author-X-Name-First: Timothy E. Author-X-Name-Last: Hanson Title: Marginal Bayesian Semiparametric Modeling of Mismeasured Multivariate Interval-Censored Data Abstract: Motivated by data gathered in an oral health study, we propose a Bayesian nonparametric approach for population-averaged modeling of correlated time-to-event data, when the responses can only be determined to lie in an interval obtained from a sequence of examination times and the determination of the occurrence of the event is subject to misclassification. The joint model for the true, unobserved time-to-event data is defined semiparametrically; proportional hazards, proportional odds, and accelerated failure time (proportional quantiles) are all fit and compared. The baseline distribution is modeled as a flexible tailfree prior. The joint model is completed by considering a parametric copula function. A general misclassification model is discussed in detail, considering the possibility that different examiners were involved in the assessment of the occurrence of the events for a given subject across time. We provide empirical evidence that the model can be used to estimate the underlying time-to-event distribution and the misclassification parameters without any external information about the latter parameters. We also illustrate the effect on the statistical inferences of neglecting the presence of misclassification. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 129-145 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2018.1476240 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1476240 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:129-145 Template-Type: ReDIF-Article 1.0 Author-Name: Tingting Zhou Author-X-Name-First: Tingting Author-X-Name-Last: Zhou Author-Name: Michael R. Elliott Author-X-Name-First: Michael R. Author-X-Name-Last: Elliott Author-Name: Roderick J. A. Little Author-X-Name-First: Roderick J. A. Author-X-Name-Last: Little Title: Penalized Spline of Propensity Methods for Treatment Comparison Abstract: Valid causal inference from observational studies requires controlling for confounders. When time-dependent confounders are present that serve as mediators of treatment effects and affect future treatment assignment, standard regression methods for controlling for confounders fail. Similar issues also arise in trials with sequential randomization, when randomization at later time points is based on intermediate outcomes from earlier randomized assignments. We propose a robust multiple imputation-based approach to causal inference in this setting called penalized spline of propensity methods for treatment comparison (PENCOMP), which builds on the penalized spline of propensity prediction method for missing data problems. PENCOMP estimates causal effects by imputing missing potential outcomes with flexible spline models and draws inference based on imputed and observed outcomes. Under the SUTVA, positivity, and ignorability assumptions, PENCOMP has a double robustness property for causal effects. Simulations suggest that it tends to outperform doubly robust marginal structural modeling when the weights are variable. We apply our method to the multicenter AIDS cohort study to estimate the effect of antiretroviral treatment on CD4 counts in HIV-infected patients. Supplementary materials for this article are available online. Code submitted with this article was checked by an Associate Editor for Reproducibility and is available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1-19 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2018.1518234 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1518234 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:1-19 Template-Type: ReDIF-Article 1.0 Author-Name: Andrew J. Spieker Author-X-Name-First: Andrew J. Author-X-Name-Last: Spieker Title: Comment on Penalized Spline of Propensity Methods for Treatment Comparison by Zhou, Elliott, and Little Journal: Journal of the American Statistical Association Pages: 20-23 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2018.1537913 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537913 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:20-23 Template-Type: ReDIF-Article 1.0 Author-Name: Joseph Antonelli Author-X-Name-First: Joseph Author-X-Name-Last: Antonelli Author-Name: Michael J. Daniels Author-X-Name-First: Michael J. Author-X-Name-Last: Daniels Title: Discussion of PENCOMP Journal: Journal of the American Statistical Association Pages: 24-27 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2018.1537914 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537914 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:24-27 Template-Type: ReDIF-Article 1.0 Author-Name: Qingxia Chen Author-X-Name-First: Qingxia Author-X-Name-Last: Chen Author-Name: Frank E. Harrell Author-X-Name-First: Frank E. Author-X-Name-Last: Harrell Title: Comment: Penalized Spline of Propensity Methods for Treatment Comparison Journal: Journal of the American Statistical Association Pages: 28-30 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2018.1537915 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537915 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:28-30 Template-Type: ReDIF-Article 1.0 Author-Name: Shu Yang Author-X-Name-First: Shu Author-X-Name-Last: Yang Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Title: Discussion of “Penalized Spline of Propensity Methods for Treatment Comparison” by Zhou, Elliott, and Little Journal: Journal of the American Statistical Association Pages: 30-32 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2018.1537916 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537916 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:30-32 Template-Type: ReDIF-Article 1.0 Author-Name: Georgia Papadogeorgou Author-X-Name-First: Georgia Author-X-Name-Last: Papadogeorgou Author-Name: Fan Li Author-X-Name-First: Fan Author-X-Name-Last: Li Title: Discussion of “Penalized Spline of Propensity Methods for Treatment Comparison” Journal: Journal of the American Statistical Association Pages: 32-35 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2018.1543120 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543120 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:32-35 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Corrigendum Journal: Journal of the American Statistical Association Pages: 484-484 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2018.1548858 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1548858 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:484-484 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Correction Journal: Journal of the American Statistical Association Pages: 485-485 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2018.1548859 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1548859 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:485-485 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Correction Journal: Journal of the American Statistical Association Pages: 486-486 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2018.1548861 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1548861 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:486-486 Template-Type: ReDIF-Article 1.0 Author-Name: Tingting Zhou Author-X-Name-First: Tingting Author-X-Name-Last: Zhou Author-Name: Michael R. Elliott Author-X-Name-First: Michael R. Author-X-Name-Last: Elliott Author-Name: Roderick J. A. Little Author-X-Name-First: Roderick J. A. Author-X-Name-Last: Little Title: Penalized Spline of Propensity Methods for Treatment Comparison: Rejoinder Journal: Journal of the American Statistical Association Pages: 35-38 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2019.1576439 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1576439 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:35-38 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Editorial Collaborators Journal: Journal of the American Statistical Association Pages: 487-494 Issue: 525 Volume: 114 Year: 2019 Month: 1 X-DOI: 10.1080/01621459.2019.1583915 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1583915 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:487-494 Template-Type: ReDIF-Article 1.0 Author-Name: Lisa Morrissey LaVange Author-X-Name-First: Lisa Morrissey Author-X-Name-Last: LaVange Title: Choose to Lead Abstract: Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/JASA.Abstract–Each year, the Journal of the American Statistical Association publishes the presidential address from the Joint Statistical Meetings. Here we present the 2018 address verbatim save for the addition of references and a few minor editorial corrections. Journal: Journal of the American Statistical Association Pages: 1427-1435 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1661183 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1661183 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1427-1435 Template-Type: ReDIF-Article 1.0 Author-Name: Christopher Jackson Author-X-Name-First: Christopher Author-X-Name-Last: Jackson Author-Name: Anne Presanis Author-X-Name-First: Anne Author-X-Name-Last: Presanis Author-Name: Stefano Conti Author-X-Name-First: Stefano Author-X-Name-Last: Conti Author-Name: Daniela De Angelis Author-X-Name-First: Daniela Author-X-Name-Last: De Angelis Title: Value of Information: Sensitivity Analysis and Research Design in Bayesian Evidence Synthesis Abstract: Suppose we have a Bayesian model that combines evidence from several different sources. We want to know which model parameters most affect the estimate or decision from the model, or which of the parameter uncertainties drive the decision uncertainty. Furthermore, we want to prioritize what further data should be collected. These questions can be addressed by Value of Information (VoI) analysis, in which we estimate expected reductions in loss from learning specific parameters or collecting data of a given design. We describe the theory and practice of VoI for Bayesian evidence synthesis, using and extending ideas from health economics, computer modeling and Bayesian design. The methods are general to a range of decision problems including point estimation and choices between discrete actions. We apply them to a model for estimating prevalence of HIV infection, combining indirect information from surveys, registers, and expert beliefs. This analysis shows which parameters contribute most of the uncertainty about each prevalence estimate, and the expected improvements in precision from specific amounts of additional data. These benefits can be traded with the costs of sampling to determine an optimal sample size. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1436-1449 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1562932 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1562932 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1436-1449 Template-Type: ReDIF-Article 1.0 Author-Name: Devin Francom Author-X-Name-First: Devin Author-X-Name-Last: Francom Author-Name: Bruno Sansó Author-X-Name-First: Bruno Author-X-Name-Last: Sansó Author-Name: Vera Bulaevskaya Author-X-Name-First: Vera Author-X-Name-Last: Bulaevskaya Author-Name: Donald Lucas Author-X-Name-First: Donald Author-X-Name-Last: Lucas Author-Name: Matthew Simpson Author-X-Name-First: Matthew Author-X-Name-Last: Simpson Title: Inferring Atmospheric Release Characteristics in a Large Computer Experiment Using Bayesian Adaptive Splines Abstract: An atmospheric release of hazardous material, whether accidental or intentional, can be catastrophic for those in the path of the plume. Predicting the path of a plume based on characteristics of the release (location, amount, and duration) and meteorological conditions is an active research area highly relevant for emergency and long-term response to these releases. As a result, researchers have developed particle dispersion simulators to provide plume path predictions that incorporate release characteristics and meteorological conditions. However, since release characteristics and meteorological conditions are often unknown, the inverse problem is of great interest, that is, based on all the observations of the plume so far, what can be inferred about the release characteristics? This is the question we seek to answer using plume observations from a controlled release at the Diablo Canyon Nuclear Power Plant in Central California. With access to a large number of evaluations of a computationally expensive particle dispersion simulator that includes continuous and categorical inputs and spatio-temporal output, building a fast statistical surrogate model (or emulator) presents many statistical challenges, but is an essential tool for inverse modeling and sensitivity analysis. We achieve accurate emulation using Bayesian adaptive splines to model weights on empirical orthogonal functions. We use this emulator as well as appropriately identifiable simulator discrepancy and observational error models to calibrate the simulator, thus finding a posterior distribution of characteristics of the release. Since the release was controlled, these characteristics are known, making it possible to compare our findings to the truth. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1450-1465 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1562933 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1562933 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1450-1465 Template-Type: ReDIF-Article 1.0 Author-Name: Yimeng Xie Author-X-Name-First: Yimeng Author-X-Name-Last: Xie Author-Name: Li Xu Author-X-Name-First: Li Author-X-Name-Last: Xu Author-Name: Jie Li Author-X-Name-First: Jie Author-X-Name-Last: Li Author-Name: Xinwei Deng Author-X-Name-First: Xinwei Author-X-Name-Last: Deng Author-Name: Yili Hong Author-X-Name-First: Yili Author-X-Name-Last: Hong Author-Name: Korine Kolivras Author-X-Name-First: Korine Author-X-Name-Last: Kolivras Author-Name: David N. Gaines Author-X-Name-First: David N. Author-X-Name-Last: Gaines Title: Spatial Variable Selection and An Application to Virginia Lyme Disease Emergence Abstract: Lyme disease is an infectious disease, that is, caused by a bacterium called Borrelia burgdorferi sensu stricto. In the United States, Lyme disease is one of the most common infectious diseases. The major endemic areas of the disease are New England, Mid-Atlantic, East-North Central, South Atlantic, and West North-Central. Virginia is on the front-line of the disease’s diffusion from the northeast to the south. One of the research objectives for the infectious disease community is to identify environmental and economic variables that are associated with the emergence of Lyme disease. In this article, we use a spatial Poisson regression model to link the spatial disease counts and environmental and economic variables, and develop a spatial variable selection procedure to effectively identify important factors by using an adaptive elastic net penalty. The proposed methods can automatically select important covariates, while adjusting for possible spatial correlations of disease counts. The performance of the proposed method is studied and compared with existing methods via a comprehensive simulation study. We apply the developed variable selection methods to the Virginia Lyme disease data and identify important variables that are new to the literature. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1466-1480 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1564670 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1564670 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1466-1480 Template-Type: ReDIF-Article 1.0 Author-Name: Oliver Stoner Author-X-Name-First: Oliver Author-X-Name-Last: Stoner Author-Name: Theo Economou Author-X-Name-First: Theo Author-X-Name-Last: Economou Author-Name: Gabriela Drummond Marques da Silva Author-X-Name-First: Gabriela Author-X-Name-Last: Drummond Marques da Silva Title: A Hierarchical Framework for Correcting Under-Reporting in Count Data Abstract: Tuberculosis poses a global health risk and Brazil is among the top 20 countries by absolute mortality. However, this epidemiological burden is masked by under-reporting, which impairs planning for effective intervention. We present a comprehensive investigation and application of a Bayesian hierarchical approach to modeling and correcting under-reporting in tuberculosis counts, a general problem arising in observational count data. The framework is applicable to fully under-reported data, relying only on an informative prior distribution for the mean reporting rate to supplement the partial information in the data. Covariates are used to inform both the true count-generating process and the under-reporting mechanism, while also allowing for complex spatio-temporal structures. We present several sensitivity analyses based on simulation experiments to aid the elicitation of the prior distribution for the mean reporting rate and decisions relating to the inclusion of covariates. Both prior and posterior predictive model checking are presented, as well as a critical evaluation of the approach. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1481-1492 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1573732 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1573732 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1481-1492 Template-Type: ReDIF-Article 1.0 Author-Name: Bradley C. Saul Author-X-Name-First: Bradley C. Author-X-Name-Last: Saul Author-Name: Michael G. Hudgens Author-X-Name-First: Michael G. Author-X-Name-Last: Hudgens Author-Name: Michael A. Mallin Author-X-Name-First: Michael A. Author-X-Name-Last: Mallin Title: Downstream Effects of Upstream Causes Abstract: The United States Environmental Protection Agency considers nutrient pollution in stream ecosystems one of the United States’ most pressing environmental challenges. But limited independent replicates, lack of experimental randomization, and space- and time-varying confounding handicap causal inference on effects of nutrient pollution. In this article, the causal g-methods are extended to allow for exposures to vary in time and space in order to assess the effects of nutrient pollution on chlorophyll a—a proxy for algal production. Publicly available data from North Carolina’s Cape Fear River and a simulation study are used to show how causal effects of upstream nutrient concentrations on downstream chlorophyll a levels may be estimated from typical water quality monitoring data. Estimates obtained from the parametric g-formula, a marginal structural model, and a structural nested model indicate that chlorophyll a concentrations at Lock and Dam 1 were influenced by nitrate concentrations measured 86 to 109 km upstream, an area where four major industrial and municipal point sources discharge wastewater. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1493-1504 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1574226 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1574226 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1493-1504 Template-Type: ReDIF-Article 1.0 Author-Name: Zhengwu Zhang Author-X-Name-First: Zhengwu Author-X-Name-Last: Zhang Author-Name: Maxime Descoteaux Author-X-Name-First: Maxime Author-X-Name-Last: Descoteaux Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Nonparametric Bayes Models of Fiber Curves Connecting Brain Regions Abstract: In studying structural inter-connections in the human brain, it is common to first estimate fiber bundles connecting different regions relying on diffusion MRI. These fiber bundles act as highways for neural activity. Current statistical methods reduce the rich information into an adjacency matrix, with the elements containing a count of fibers or a mean diffusion feature along the fibers. The goal of this article is to avoid discarding the rich geometric information of fibers, developing flexible models for characterizing the population distribution of fibers between brain regions of interest within and across different individuals. We start by decomposing each fiber into a rotation matrix, shape and translation from a global reference curve. These components are viewed as data lying on a product space composed of different Euclidean spaces and manifolds. To nonparametrically model the distribution within and across individuals, we rely on a hierarchical mixture of product kernels specific to the component spaces. Taking a Bayesian approach to inference, we develop efficient methods for posterior sampling. The approach automatically produces clusters of fibers within and across individuals. Applying the method to Human Connectome Project data, we find interesting relationships between brain fiber geometry and reading ability. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1505-1517 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1574582 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1574582 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1505-1517 Template-Type: ReDIF-Article 1.0 Author-Name: Chris J. Oates Author-X-Name-First: Chris J. Author-X-Name-Last: Oates Author-Name: Jon Cockayne Author-X-Name-First: Jon Author-X-Name-Last: Cockayne Author-Name: Robert G. Aykroyd Author-X-Name-First: Robert G. Author-X-Name-Last: Aykroyd Author-Name: Mark Girolami Author-X-Name-First: Mark Author-X-Name-Last: Girolami Title: Bayesian Probabilistic Numerical Methods in Time-Dependent State Estimation for Industrial Hydrocyclone Equipment Abstract: The use of high-power industrial equipment, such as large-scale mixing equipment or a hydrocyclone for separation of particles in liquid suspension, demands careful monitoring to ensure correct operation. The fundamental task of state-estimation for the liquid suspension can be posed as a time-evolving inverse problem and solved with Bayesian statistical methods. In this article, we extend Bayesian methods to incorporate statistical models for the error that is incurred in the numerical solution of the physical governing equations. This enables full uncertainty quantification within a principled computation-precision trade-off, in contrast to the over-confident inferences that are obtained when all sources of numerical error are ignored. The method is cast within a sequential Monte Carlo framework and an optimized implementation is provided in Python. Journal: Journal of the American Statistical Association Pages: 1518-1531 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1574583 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1574583 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1518-1531 Template-Type: ReDIF-Article 1.0 Author-Name: Dean Knox Author-X-Name-First: Dean Author-X-Name-Last: Knox Author-Name: Teppei Yamamoto Author-X-Name-First: Teppei Author-X-Name-Last: Yamamoto Author-Name: Matthew A. Baum Author-X-Name-First: Matthew A. Author-X-Name-Last: Baum Author-Name: Adam J. Berinsky Author-X-Name-First: Adam J. Author-X-Name-Last: Berinsky Title: Design, Identification, and Sensitivity Analysis for Patient Preference Trials Abstract: Social and medical scientists are often concerned that the external validity of experimental results may be compromised because of heterogeneous treatment effects. If a treatment has different effects on those who would choose to take it and those who would not, the average treatment effect estimated in a standard randomized controlled trial (RCT) may give a misleading picture of its impact outside of the study sample. Patient preference trials (PPTs), where participants’ preferences over treatment options are incorporated in the study design, provide a possible solution. In this paper, we provide a systematic analysis of PPTs based on the potential outcomes framework of causal inference. We propose a general design for PPTs with multi-valued treatments, where participants state their preferred treatments and are then randomized into either a standard RCT or a self-selection condition. We derive nonparametric sharp bounds on the average causal effects among each choice-based subpopulation of participants under the proposed design. We also propose a sensitivity analysis for the violation of the key ignorability assumption sufficient for identifying the target causal quantity. The proposed design and methodology are illustrated with an original study of partisan news media and its behavioral impact. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1532-1546 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1585248 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585248 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1532-1546 Template-Type: ReDIF-Article 1.0 Author-Name: J. L. Scealy Author-X-Name-First: J. L. Author-X-Name-Last: Scealy Author-Name: Andrew T. A. Wood Author-X-Name-First: Andrew T. A. Author-X-Name-Last: Wood Title: Scaled von Mises–Fisher Distributions and Regression Models for Paleomagnetic Directional Data Abstract: We propose a new distribution for analyzing paleomagnetic directional data, that is, a novel transformation of the von Mises–Fisher distribution. The new distribution has ellipse-like symmetry, as does the Kent distribution; however, unlike the Kent distribution the normalizing constant in the new density is easy to compute and estimation of the shape parameters is straightforward. To accommodate outliers, the model also incorporates an additional shape parameter, which controls the tail-weight of the distribution. We also develop a general regression model framework that allows both the mean direction and the shape parameters of the error distribution to depend on covariates. The proposed regression procedure is shown to be equivariant with respect to the choice of coordinate system for the directional response. To illustrate, we analyses paleomagnetic directional data from the GEOMAGIA50.v3 database. We predict the mean direction at various geological time points and show that there is significant heteroscedasticity present. It is envisaged that the regression structures and error distribution proposed here will also prove useful when covariate information is available with (i) other types of directional response data; and (ii) square-root transformed compositional data of general dimension. Supplementary materials for this article are available online. Code submitted with this article was checked by an Associate Editor for Reproducibility and is available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1547-1560 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1585249 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585249 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1547-1560 Template-Type: ReDIF-Article 1.0 Author-Name: Xueying Tang Author-X-Name-First: Xueying Author-X-Name-Last: Tang Author-Name: Yang Yang Author-X-Name-First: Yang Author-X-Name-Last: Yang Author-Name: Hong-Jie Yu Author-X-Name-First: Hong-Jie Author-X-Name-Last: Yu Author-Name: Qiao-Hong Liao Author-X-Name-First: Qiao-Hong Author-X-Name-Last: Liao Author-Name: Nikolay Bliznyuk Author-X-Name-First: Nikolay Author-X-Name-Last: Bliznyuk Title: A Spatio-Temporal Modeling Framework for Surveillance Data of Multiple Infectious Pathogens With Small Laboratory Validation Sets Abstract: Many surveillance systems of infectious diseases are syndrome-based, capturing patients by clinical manifestation. Only a fraction of patients, mostly severe cases, undergo laboratory validation to identify the underlying pathogen. Motivated by the need to understand transmission dynamics and associate risk factors of enteroviruses causing the hand, foot, and mouth disease (HFMD) in China, we developed a Bayesian spatio-temporal modeling framework for surveillance data of infectious diseases with small validation sets. A novel approach was proposed to sample unobserved pathogen-specific patient counts over space and time and was compared to an existing sampling approach. The practical utility of this framework in identifying key parameters was assessed in simulations for a range of realistic sizes of the validation set. Several designs of sampling patients for laboratory validation were compared with and without aggregation of sparse validation data. The methodology was applied to the 2009 HFMD epidemic in southern China to evaluate transmissibility and the effects of climatic conditions for the leading pathogens of the disease, enterovirus 71, and Coxsackie A16. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1561-1573 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1585250 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585250 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1561-1573 Template-Type: ReDIF-Article 1.0 Author-Name: Yixin Wang Author-X-Name-First: Yixin Author-X-Name-Last: Wang Author-Name: David M. Blei Author-X-Name-First: David M. Author-X-Name-Last: Blei Title: The Blessings of Multiple Causes Abstract: Causal inference from observational data is a vital problem, but it comes with strong assumptions. Most methods assume that we observe all confounders, variables that affect both the causal variables and the outcome variables. This assumption is standard but it is also untestable. In this article, we develop the deconfounder, a way to do causal inference with weaker assumptions than the traditional methods require. The deconfounder is designed for problems of multiple causal inference: scientific studies that involve multiple causes whose effects are simultaneously of interest. Specifically, the deconfounder combines unsupervised machine learning and predictive model checking to use the dependencies among multiple causes as indirect evidence for some of the unobserved confounders. We develop the deconfounder algorithm, prove that it is unbiased, and show that it requires weaker assumptions than traditional causal inference. We analyze its performance in three types of studies: semi-simulated data around smoking and lung cancer, semi-simulated data around genome-wide association studies, and a real dataset about actors and movie revenue. The deconfounder is an effective approach to estimating causal effects in problems of multiple causal inference. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1574-1596 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1686987 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1686987 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1574-1596 Template-Type: ReDIF-Article 1.0 Author-Name: Alexander D’Amour Author-X-Name-First: Alexander Author-X-Name-Last: D’Amour Title: Comment: Reflections on the Deconfounder Journal: Journal of the American Statistical Association Pages: 1597-1601 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1689138 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1689138 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1597-1601 Template-Type: ReDIF-Article 1.0 Author-Name: Susan Athey Author-X-Name-First: Susan Author-X-Name-Last: Athey Author-Name: Guido W. Imbens Author-X-Name-First: Guido W. Author-X-Name-Last: Imbens Author-Name: Michael Pollmann Author-X-Name-First: Michael Author-X-Name-Last: Pollmann Title: Comment on: “The Blessings of Multiple Causes” by Yixin Wang and David M. Blei Journal: Journal of the American Statistical Association Pages: 1602-1604 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1691008 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1691008 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1602-1604 Template-Type: ReDIF-Article 1.0 Author-Name: Kosuke Imai Author-X-Name-First: Kosuke Author-X-Name-Last: Imai Author-Name: Zhichao Jiang Author-X-Name-First: Zhichao Author-X-Name-Last: Jiang Title: Comment: The Challenges of Multiple Causes Abstract: We begin by congratulating Yixin Wang and David Blei for their thought-provoking article that opens up a new research frontier in the field of causal inference. The authors directly tackle the challenging question of how to infer causal effects of many treatments in the presence of unmeasured confounding. We expect their article to have a major impact by further advancing our understanding of this important methodological problem. This commentary has two goals. We first critically review the deconfounder method and point out its advantages and limitations. We then briefly consider three possible ways to address some of the limitations of the deconfounder method. Journal: Journal of the American Statistical Association Pages: 1605-1610 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1689137 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1689137 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1605-1610 Template-Type: ReDIF-Article 1.0 Author-Name: Elizabeth L. Ogburn Author-X-Name-First: Elizabeth L. Author-X-Name-Last: Ogburn Author-Name: Ilya Shpitser Author-X-Name-First: Ilya Author-X-Name-Last: Shpitser Author-Name: Eric J. Tchetgen Tchetgen Author-X-Name-First: Eric J. Tchetgen Author-X-Name-Last: Tchetgen Title: Comment on “Blessings of Multiple Causes” Journal: Journal of the American Statistical Association Pages: 1611-1615 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1689139 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1689139 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1611-1615 Template-Type: ReDIF-Article 1.0 Author-Name: Yixin Wang Author-X-Name-First: Yixin Author-X-Name-Last: Wang Author-Name: David M. Blei Author-X-Name-First: David M. Author-X-Name-Last: Blei Title: The Blessings of Multiple Causes: Rejoinder Journal: Journal of the American Statistical Association Pages: 1616-1619 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1690841 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1690841 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1616-1619 Template-Type: ReDIF-Article 1.0 Author-Name: Kai Zhang Author-X-Name-First: Kai Author-X-Name-Last: Zhang Title: BET on Independence Abstract: We study the problem of nonparametric dependence detection. Many existing methods may suffer severe power loss due to nonuniform consistency, which we illustrate with a paradox. To avoid such power loss, we approach the nonparametric test of independence through the new framework of binary expansion statistics (BEStat) and binary expansion testing (BET), which examine dependence through a novel binary expansion filtration approximation of the copula. Through a Hadamard transform, we find that the symmetry statistics in the filtration are complete sufficient statistics for dependence. These statistics are also uncorrelated under the null. By using symmetry statistics, the BET avoids the problem of nonuniform consistency and improves upon a wide class of commonly used methods (a) by achieving the minimax rate in sample size requirement for reliable power and (b) by providing clear interpretations of global relationships upon rejection of independence. The binary expansion approach also connects the symmetry statistics with the current computing system to facilitate efficient bitwise implementation. We illustrate the BET with a study of the distribution of stars in the night sky and with an exploratory data analysis of the TCGA breast cancer data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1620-1637 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1537921 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537921 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1620-1637 Template-Type: ReDIF-Article 1.0 Author-Name: Shubhadeep Chakraborty Author-X-Name-First: Shubhadeep Author-X-Name-Last: Chakraborty Author-Name: Xianyang Zhang Author-X-Name-First: Xianyang Author-X-Name-Last: Zhang Title: Distance Metrics for Measuring Joint Dependence with Application to Causal Inference Abstract: Many statistical applications require the quantification of joint dependence among more than two random vectors. In this work, we generalize the notion of distance covariance to quantify joint dependence among d≥2 random vectors. We introduce the high-order distance covariance to measure the so-called Lancaster interaction dependence. The joint distance covariance is then defined as a linear combination of pairwise distance covariances and their higher-order counterparts which together completely characterize mutual independence. We further introduce some related concepts including the distance cumulant, distance characteristic function, and rank-based distance covariance. Empirical estimators are constructed based on certain Euclidean distances between sample elements. We study the large-sample properties of the estimators and propose a bootstrap procedure to approximate their sampling distributions. The asymptotic validity of the bootstrap procedure is justified under both the null and alternative hypotheses. The new metrics are employed to perform model selection in causal inference, which is based on the joint independence testing of the residuals from the fitted structural equation models. The effectiveness of the method is illustrated via both simulated and real datasets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1638-1650 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1513364 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1513364 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1638-1650 Template-Type: ReDIF-Article 1.0 Author-Name: Xinran Li Author-X-Name-First: Xinran Author-X-Name-Last: Li Author-Name: Peng Ding Author-X-Name-First: Peng Author-X-Name-Last: Ding Author-Name: Qian Lin Author-X-Name-First: Qian Author-X-Name-Last: Lin Author-Name: Dawei Yang Author-X-Name-First: Dawei Author-X-Name-Last: Yang Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Title: Randomization Inference for Peer Effects Abstract: Many previous causal inference studies require no interference, that is, the potential outcomes of a unit do not depend on the treatments of other units. However, this no-interference assumption becomes unreasonable when a unit interacts with other units in the same group or cluster. In a motivating application, a top Chinese university admits students through two channels: the college entrance exam (also known as Gaokao) and recommendation (often based on Olympiads in various subjects). The university randomly assigns students to dorms, each of which hosts four students. Students within the same dorm live together and have extensive interactions. Therefore, it is likely that peer effects exist and the no-interference assumption does not hold. It is important to understand peer effects, because they give useful guidance for future roommate assignment to improve the performance of students. We define peer effects using potential outcomes. We then propose a randomization-based inference framework to study peer effects with arbitrary numbers of peers and peer types. Our inferential procedure does not assume any parametric model on the outcome distribution. Our analysis gives useful practical guidance for policy makers of the university. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1651-1664 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1512863 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1512863 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1651-1664 Template-Type: ReDIF-Article 1.0 Author-Name: Iavor Bojinov Author-X-Name-First: Iavor Author-X-Name-Last: Bojinov Author-Name: Neil Shephard Author-X-Name-First: Neil Author-X-Name-Last: Shephard Title: Time Series Experiments and Causal Estimands: Exact Randomization Tests and Trading Abstract: We define causal estimands for experiments on single time series, extending the potential outcome framework to dealing with temporal data. Our approach allows the estimation of a broad class of these estimands and exact randomization-based p-values for testing causal effects, without imposing stringent assumptions. We further derive a general central limit theorem that can be used to conduct conservative tests and build confidence intervals for causal effects. Finally, we provide three methods for generalizing our approach to multiple units that are receiving the same class of treatment, over time. We test our methodology on simulated “potential autoregressions,” which have a causal interpretation. Our methodology is partially inspired by data from a large number of experiments carried out by a financial company who compared the impact of two different ways of trading equity futures contracts. We use our methodology to make causal statements about their trading methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1665-1682 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1527225 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1527225 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1665-1682 Template-Type: ReDIF-Article 1.0 Author-Name: Daniel Peña Author-X-Name-First: Daniel Author-X-Name-Last: Peña Author-Name: Ezequiel Smucler Author-X-Name-First: Ezequiel Author-X-Name-Last: Smucler Author-Name: Victor J. Yohai Author-X-Name-First: Victor J. Author-X-Name-Last: Yohai Title: Forecasting Multiple Time Series With One-Sided Dynamic Principal Components Abstract: We define one-sided dynamic principal components (ODPC) for time series as linear combinations of the present and past values of the series that minimize the reconstruction mean squared error. Usually dynamic principal components have been defined as functions of past and future values of the series and therefore they are not appropriate for forecasting purposes. On the contrary, it is shown that the ODPC introduced in this article can be successfully used for forecasting high-dimensional multiple time series. An alternating least-squares algorithm to compute the proposed ODPC is presented. We prove that for stationary and ergodic time series the estimated values converge to their population analogs. We also prove that asymptotically, when both the number of series and the sample size go to infinity, if the data follow a dynamic factor model, the reconstruction obtained with ODPC converges in mean square to the common part of the factor model. The results of a simulation study show that the forecasts obtained with ODPC compare favorably with those obtained using other forecasting methods based on dynamic factor models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1683-1694 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1520117 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1520117 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1683-1694 Template-Type: ReDIF-Article 1.0 Author-Name: Torben G. Andersen Author-X-Name-First: Torben G. Author-X-Name-Last: Andersen Author-Name: Martin Thyrsgaard Author-X-Name-First: Martin Author-X-Name-Last: Thyrsgaard Author-Name: Viktor Todorov Author-X-Name-First: Viktor Author-X-Name-Last: Todorov Title: Time-Varying Periodicity in Intraday Volatility Abstract: We develop a nonparametric test for whether return volatility exhibits time-varying intraday periodicity using a long time series of high-frequency data. Our null hypothesis, commonly adopted in work on volatility modeling, is that volatility follows a stationary process combined with a constant time-of-day periodic component. We construct time-of-day volatility estimates and studentize the high-frequency returns with these periodic components. If the intraday periodicity is invariant, then the distribution of the studentized returns should be identical across the trading day. Consequently, the test compares the empirical characteristic function of the studentized returns across the trading day. The limit distribution of the test depends on the error in recovering volatility from discrete return data and the empirical process error associated with estimating volatility moments through their sample counterparts. Critical values are computed via easy-to-implement simulation. In an empirical application to S&P 500 index returns, we find strong evidence for variation in the intraday volatility pattern driven in part by the current level of volatility. When volatility is elevated, the period preceding the market close constitutes a significantly higher fraction of the total daily integrated volatility than during low volatility regimes. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1695-1707 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1512864 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1512864 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1695-1707 Template-Type: ReDIF-Article 1.0 Author-Name: Anru Zhang Author-X-Name-First: Anru Author-X-Name-Last: Zhang Author-Name: Rungang Han Author-X-Name-First: Rungang Author-X-Name-Last: Han Title: Optimal Sparse Singular Value Decomposition for High-Dimensional High-Order Data Abstract: In this article, we consider the sparse tensor singular value decomposition, which aims for dimension reduction on high-dimensional high-order data with certain sparsity structure. A method named sparse tensor alternating thresholding for singular value decomposition (STAT-SVD) is proposed. The proposed procedure features a novel double projection & thresholding scheme, which provides a sharp criterion for thresholding in each iteration. Compared with regular tensor SVD model, STAT-SVD permits more robust estimation under weaker assumptions. Both the upper and lower bounds for estimation accuracy are developed. The proposed procedure is shown to be minimax rate-optimal in a general class of situations. Simulation studies show that STAT-SVD performs well under a variety of configurations. We also illustrate the merits of the proposed procedure on a longitudinal tensor dataset on European country mortality rates. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1708-1725 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1527227 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1527227 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1708-1725 Template-Type: ReDIF-Article 1.0 Author-Name: Qian Lin Author-X-Name-First: Qian Author-X-Name-Last: Lin Author-Name: Zhigen Zhao Author-X-Name-First: Zhigen Author-X-Name-Last: Zhao Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Title: Sparse Sliced Inverse Regression via Lasso Abstract: For multiple index models, it has recently been shown that the sliced inverse regression (SIR) is consistent for estimating the sufficient dimension reduction (SDR) space if and only if ρ=limpn=0 , where p is the dimension and n is the sample size. Thus, when p is of the same or a higher order of n, additional assumptions such as sparsity must be imposed in order to ensure consistency for SIR. By constructing artificial response variables made up from top eigenvectors of the estimated conditional covariance matrix, we introduce a simple Lasso regression method to obtain an estimate of the SDR space. The resulting algorithm, Lasso-SIR, is shown to be consistent and achieves the optimal convergence rate under certain sparsity conditions when p is of order o(n2λ2) , where λ is the generalized signal-to-noise ratio. We also demonstrate the superior performance of Lasso-SIR compared with existing approaches via extensive numerical studies and several real data examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1726-1739 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1520115 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1520115 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1726-1739 Template-Type: ReDIF-Article 1.0 Author-Name: Efang Kong Author-X-Name-First: Efang Author-X-Name-Last: Kong Author-Name: Yingcun Xia Author-X-Name-First: Yingcun Author-X-Name-Last: Xia Author-Name: Wei Zhong Author-X-Name-First: Wei Author-X-Name-Last: Zhong Title: Composite Coefficient of Determination and Its Application in Ultrahigh Dimensional Variable Screening Abstract: In this article, we propose to measure the dependence between two random variables through a composite coefficient of determination (CCD) of a set of nonparametric regressions. These regressions take consecutive binarizations of one variable as the response and the other variable as the predictor. The resulting measure is invariant to monotonic marginal variable transformation, rendering it robust against heavy-tailed distributions and outliers, and convenient for independent testing. Estimation of CCD could be done through kernel smoothing, with a consistency rate of root-n. CCD is a natural measure of the importance of variables in regression and its sure screening property, when used for variable screening, is also established. Comprehensive simulation studies and real data analysis show that the newly proposed measure quite often turns out to be the most preferred compared to other existing methods both in independence testing and in variable screening. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1740-1751 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1514305 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1514305 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1740-1751 Template-Type: ReDIF-Article 1.0 Author-Name: Weixin Yao Author-X-Name-First: Weixin Author-X-Name-Last: Yao Author-Name: Debmalya Nandy Author-X-Name-First: Debmalya Author-X-Name-Last: Nandy Author-Name: Bruce G. Lindsay Author-X-Name-First: Bruce G. Author-X-Name-Last: Lindsay Author-Name: Francesca Chiaromonte Author-X-Name-First: Francesca Author-X-Name-Last: Chiaromonte Title: Covariate Information Matrix for Sufficient Dimension Reduction Abstract: Building upon recent research on the applications of the density information matrix, we develop a tool for sufficient dimension reduction (SDR) in regression problems called covariate information matrix (CIM). CIM exhaustively identifies the central subspace (CS) and provides a rank ordering of the reduced covariates in terms of their regression information. Compared to other popular SDR methods, CIM does not require distributional assumptions on the covariates, or estimation of the mean regression function. CIM is implemented via eigen-decomposition of a matrix estimated with a previously developed efficient nonparametric density estimation technique. We also propose a bootstrap-based diagnostic plot for estimating the dimension of the CS. Results of simulations and real data applications demonstrate superior or competitive performance of CIM compared to that of some other SDR methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1752-1764 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1515080 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1515080 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1752-1764 Template-Type: ReDIF-Article 1.0 Author-Name: Francis K. C. Hui Author-X-Name-First: Francis K. C. Author-X-Name-Last: Hui Author-Name: C. You Author-X-Name-First: C. Author-X-Name-Last: You Author-Name: H. L. Shang Author-X-Name-First: H. L. Author-X-Name-Last: Shang Author-Name: Samuel Müller Author-X-Name-First: Samuel Author-X-Name-Last: Müller Title: Semiparametric Regression Using Variational Approximations Abstract: Semiparametric regression offers a flexible framework for modeling nonlinear relationships between a response and covariates. A prime example are generalized additive models (GAMs) where splines (say) are used to approximate nonlinear functional components in conjunction with a quadratic penalty to control for overfitting. Estimation and inference are then generally performed based on the penalized likelihood, or under a mixed model framework. The penalized likelihood framework is fast but potentially unstable, and choosing the smoothing parameters needs to be done externally using cross-validation, for instance. The mixed model framework tends to be more stable and offers a natural way for choosing the smoothing parameters, but for nonnormal responses involves an intractable integral. In this article, we introduce a new framework for semiparametric regression based on variational approximations (VA). The approach possesses the stability and natural inference tools of the mixed model framework, while achieving computation times comparable to using penalized likelihood. Focusing on GAMs, we derive fully tractable variational likelihoods for some common response types. We present several features of the VA framework for inference, including a variational information matrix for inference on parametric components, and a closed-form update for estimating the smoothing parameter. We demonstrate the consistency of the VA estimates, and an asymptotic normality result for the parametric component of the model. Simulation studies show the VA framework performs similarly to and sometimes better than currently available software for fitting GAMs. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1765-1777 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1518235 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1518235 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1765-1777 Template-Type: ReDIF-Article 1.0 Author-Name: Kin Yau Wong Author-X-Name-First: Kin Yau Author-X-Name-Last: Wong Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: D. Y. Lin Author-X-Name-First: D. Y. Author-X-Name-Last: Lin Title: Robust Score Tests With Missing Data in Genomics Studies Abstract: Analysis of genomic data is often complicated by the presence of missing values, which may arise due to cost or other reasons. The prevailing approach of single imputation is generally invalid if the imputation model is misspecified. In this article, we propose a robust score statistic based on imputed data for testing the association between a phenotype and a genomic variable with (partially) missing values. We fit a semiparametric regression model for the genomic variable against an arbitrary function of the linear predictor in the phenotype model and impute each missing value by its estimated posterior expectation. We show that the score statistic with such imputed values is asymptotically unbiased under general missing-data mechanisms, even when the imputation model is misspecified. We develop a spline-based method to estimate the semiparametric imputation model and derive the asymptotic distribution of the corresponding score statistic with a consistent variance estimator using sieve approximation theory and empirical process theory. The proposed test is computationally feasible regardless of the number of independent variables in the imputation model. We demonstrate the advantages of the proposed method over existing methods through extensive simulation studies and provide an application to a major cancer genomics study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1778-1786 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1514304 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1514304 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1778-1786 Template-Type: ReDIF-Article 1.0 Author-Name: X. Jessie Jeng Author-X-Name-First: X. Jessie Author-X-Name-Last: Jeng Author-Name: Teng Zhang Author-X-Name-First: Teng Author-X-Name-Last: Zhang Author-Name: Jung-Ying Tzeng Author-X-Name-First: Jung-Ying Author-X-Name-Last: Tzeng Title: Efficient Signal Inclusion With Genomic Applications Abstract: This article addresses the challenge of efficiently capturing a high proportion of true signals for subsequent data analyses when sample sizes are relatively limited with respect to data dimension. We propose the signal missing rate (SMR) as a new measure for false-negative control to account for the variability of false-negative proportion. Novel data-adaptive procedures are developed to control SMR without incurring many unnecessary false positives under dependence. We justify the efficiency and adaptivity of the proposed methods via theory and simulation. The proposed methods are applied to GWAS on human height to effectively remove irrelevant single nucleotide polymorphisms (SNPs) while retaining a high proportion of relevant SNPs for subsequent polygenic analysis. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1787-1799 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1518236 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1518236 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1787-1799 Template-Type: ReDIF-Article 1.0 Author-Name: James M. Salter Author-X-Name-First: James M. Author-X-Name-Last: Salter Author-Name: Daniel B. Williamson Author-X-Name-First: Daniel B. Author-X-Name-Last: Williamson Author-Name: John Scinocca Author-X-Name-First: John Author-X-Name-Last: Scinocca Author-Name: Viatcheslav Kharin Author-X-Name-First: Viatcheslav Author-X-Name-Last: Kharin Title: Uncertainty Quantification for Computer Models With Spatial Output Using Calibration-Optimal Bases Abstract: The calibration of complex computer codes using uncertainty quantification (UQ) methods is a rich area of statistical methodological development. When applying these techniques to simulators with spatial output, it is now standard to use principal component decomposition to reduce the dimensions of the outputs in order to allow Gaussian process emulators to predict the output for calibration. We introduce the “terminal case,” in which the model cannot reproduce observations to within model discrepancy, and for which standard calibration methods in UQ fail to give sensible results. We show that even when there is no such issue with the model, the standard decomposition on the outputs can and usually does lead to a terminal case analysis. We present a simple test to allow a practitioner to establish whether their experiment will result in a terminal case analysis, and a methodology for defining calibration-optimal bases that avoid this whenever it is not inevitable. We present the optimal rotation algorithm for doing this, and demonstrate its efficacy for an idealized example for which the usual principal component methods fail. We apply these ideas to the CanAM4 model to demonstrate the terminal case issue arising for climate models. We discuss climate model tuning and the estimation of model discrepancy within this context, and show how the optimal rotation algorithm can be used in developing practical climate model tuning tools. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1800-1814 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1514306 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1514306 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1800-1814 Template-Type: ReDIF-Article 1.0 Author-Name: Gang Li Author-X-Name-First: Gang Author-X-Name-Last: Li Author-Name: Xiaoyan Wang Author-X-Name-First: Xiaoyan Author-X-Name-Last: Wang Title: Prediction Accuracy Measures for a Nonlinear Model and for Right-Censored Time-to-Event Data Abstract: This article develops a pair of new prediction summary measures for a nonlinear prediction function with right-censored time-to-event data. The first measure, defined as the proportion of explained variance by a linearly corrected prediction function, quantifies the potential predictive power of the nonlinear prediction function. The second measure, defined as the proportion of explained prediction error by its corrected prediction function, gauges the closeness of the prediction function to its corrected version and serves as a supplementary measure to indicate (by a value less than 1) whether the correction is needed to fulfill its potential predictive power and quantify how much prediction error reduction can be realized with the correction. The two measures together provide a complete summary of the predictive accuracy of the nonlinear prediction function. We motivate these measures by first establishing a variance decomposition and a prediction error decomposition at the population level and then deriving uncensored and censored sample versions of these decompositions. We note that for the least square prediction function under the linear model with no censoring, the first measure reduces to the classical coefficient of determination and the second measure degenerates to 1. We show that the sample measures are consistent estimators of their population counterparts and conduct extensive simulations to investigate their finite sample properties. A real data illustration is provided using the PBC data. Supplementary materials for this article are available online. An R package PAmeasures has been developed and made available via the CRAN R library. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1815-1825 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1515079 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1515079 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1815-1825 Template-Type: ReDIF-Article 1.0 Author-Name: Stephane Shao Author-X-Name-First: Stephane Author-X-Name-Last: Shao Author-Name: Pierre E. Jacob Author-X-Name-First: Pierre E. Author-X-Name-Last: Jacob Author-Name: Jie Ding Author-X-Name-First: Jie Author-X-Name-Last: Ding Author-Name: Vahid Tarokh Author-X-Name-First: Vahid Author-X-Name-Last: Tarokh Title: Bayesian Model Comparison with the Hyvärinen Score: Computation and Consistency Abstract: The Bayes factor is a widely used criterion in model comparison and its logarithm is a difference of out-of-sample predictive scores under the logarithmic scoring rule. However, when some of the candidate models involve vague priors on their parameters, the log-Bayes factor features an arbitrary additive constant that hinders its interpretation. As an alternative, we consider model comparison using the Hyvärinen score. We propose a method to consistently estimate this score for parametric models, using sequential Monte Carlo methods. We show that this score can be estimated for models with tractable likelihoods as well as nonlinear non-Gaussian state-space models with intractable likelihoods. We prove the asymptotic consistency of this new model selection criterion under strong regularity assumptions in the case of nonnested models, and we provide qualitative insights for the nested case. We also use existing characterizations of proper scoring rules on discrete spaces to extend the Hyvärinen score to discrete observations. Our numerical illustrations include Lévy-driven stochastic volatility models and diffusion models for population dynamics. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1826-1837 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1518237 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1518237 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1826-1837 Template-Type: ReDIF-Article 1.0 Author-Name: Annalisa Cadonna Author-X-Name-First: Annalisa Author-X-Name-Last: Cadonna Author-Name: Athanasios Kottas Author-X-Name-First: Athanasios Author-X-Name-Last: Kottas Author-Name: Raquel Prado Author-X-Name-First: Raquel Author-X-Name-Last: Prado Title: Bayesian Spectral Modeling for Multiple Time Series Abstract: We develop a novel Bayesian modeling approach to spectral density estimation for multiple time series. The log-periodogram distribution for each series is modeled as a mixture of Gaussian distributions with frequency-dependent weights and mean functions. The implied model for the log-spectral density is a mixture of linear mean functions with frequency-dependent weights. The mixture weights are built through successive differences of a logit-normal distribution function with frequency-dependent parameters. Building from the construction for a single spectral density, we develop a hierarchical extension for multiple time series. Specifically, we set the mean functions to be common to all spectral densities and make the weights specific to the time series through the parameters of the logit-normal distribution. In addition to accommodating flexible spectral density shapes, a practically important feature of the proposed formulation is that it allows for ready posterior simulation through a Gibbs sampler with closed form full conditional distributions for all model parameters. The modeling approach is illustrated with simulated datasets and used for spectral analysis of multichannel electroencephalographic recordings, which provides a key motivating application for the proposed methodology. Journal: Journal of the American Statistical Association Pages: 1838-1853 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1520114 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1520114 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1838-1853 Template-Type: ReDIF-Article 1.0 Author-Name: Fei Jiang Author-X-Name-First: Fei Author-X-Name-Last: Jiang Author-Name: Lu Tian Author-X-Name-First: Lu Author-X-Name-Last: Tian Author-Name: Haoda Fu Author-X-Name-First: Haoda Author-X-Name-Last: Fu Author-Name: Takahiro Hasegawa Author-X-Name-First: Takahiro Author-X-Name-Last: Hasegawa Author-Name: L. J. Wei Author-X-Name-First: L. J. Author-X-Name-Last: Wei Title: Robust Alternatives to ANCOVA for Estimating the Treatment Effect via a Randomized Comparative Study Abstract: In comparing two treatments via a randomized clinical trial, the analysis of covariance (ANCOVA) technique is often utilized to estimate an overall treatment effect. The ANCOVA is generally perceived as a more efficient procedure than its simple two sample estimation counterpart. Unfortunately, when the ANCOVA model is nonlinear, the resulting estimator is generally not consistent. Recently, various nonparametric alternatives to the ANCOVA, such as the augmentation methods, have been proposed to estimate the treatment effect by adjusting the covariates. However, the properties of these alternatives have not been studied in the presence of treatment allocation imbalance. In this article, we take a different approach to explore how to improve the precision of the naive two-sample estimate even when the observed distributions of baseline covariates between two groups are dissimilar. Specifically, we derive a bias-adjusted estimation procedure constructed from a conditional inference principle via relevant ancillary statistics from the observed covariates. This estimator is shown to be asymptotically equivalent to an augmentation estimator under the unconditional setting. We utilize the data from a clinical trial for evaluating a combination treatment of cardiovascular diseases to illustrate our findings. Journal: Journal of the American Statistical Association Pages: 1854-1864 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1527226 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1527226 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1854-1864 Template-Type: ReDIF-Article 1.0 Author-Name: Benjamin D. Youngman Author-X-Name-First: Benjamin D. Author-X-Name-Last: Youngman Title: Generalized Additive Models for Exceedances of High Thresholds With an Application to Return Level Estimation for U.S. Wind Gusts Abstract: Generalized additive model (GAM) forms offer a flexible approach to capturing marginal variation. Such forms are used here to represent distributional variation in extreme values and presented in terms of spatio-temporal variation, which is often evident in environmental processes. A two-stage procedure is proposed that identifies extreme values as exceedances of a high threshold, which is defined as a fixed quantile and estimated by quantile regression. Excesses of the threshold are modelled with the generalized Pareto distribution (GPD). GAM forms are adopted for the threshold and GPD parameters, and directly estimated—in particular smoothing parameters—by restricted maximum likelihood, which provides an objective and relatively fast method of inference. The GAM models are used to produce return level maps for extreme wind gust speeds over the United States, which show extreme quantiles of the distribution of annual maximum gust speeds. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1865-1879 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1529596 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1529596 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1865-1879 Template-Type: ReDIF-Article 1.0 Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Yuan Ke Author-X-Name-First: Yuan Author-X-Name-Last: Ke Author-Name: Qiang Sun Author-X-Name-First: Qiang Author-X-Name-Last: Sun Author-Name: Wen-Xin Zhou Author-X-Name-First: Wen-Xin Author-X-Name-Last: Zhou Title: FarmTest: Factor-Adjusted Robust Multiple Testing With Approximate False Discovery Control Abstract: Large-scale multiple testing with correlated and heavy-tailed data arises in a wide range of research areas from genomics, medical imaging to finance. Conventional methods for estimating the false discovery proportion (FDP) often ignore the effect of heavy-tailedness and the dependence structure among test statistics, and thus may lead to inefficient or even inconsistent estimation. Also, the commonly imposed joint normality assumption is arguably too stringent for many applications. To address these challenges, in this article we propose a factor-adjusted robust multiple testing (FarmTest) procedure for large-scale simultaneous inference with control of the FDP. We demonstrate that robust factor adjustments are extremely important in both controlling the FDP and improving the power. We identify general conditions under which the proposed method produces consistent estimate of the FDP. As a byproduct that is of independent interest, we establish an exponential-type deviation inequality for a robust U-type covariance estimator under the spectral norm. Extensive numerical experiments demonstrate the advantage of the proposed method over several state-of-the-art methods especially when the data are generated from heavy-tailed distributions. The proposed procedures are implemented in the R-package FarmTest. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1880-1893 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1527700 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1527700 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1880-1893 Template-Type: ReDIF-Article 1.0 Author-Name: Will Wei Sun Author-X-Name-First: Will Wei Author-X-Name-Last: Sun Author-Name: Lexin Li Author-X-Name-First: Lexin Author-X-Name-Last: Li Title: Dynamic Tensor Clustering Abstract: Dynamic tensor data are becoming prevalent in numerous applications. Existing tensor clustering methods either fail to account for the dynamic nature of the data, or are inapplicable to a general-order tensor. There is also a gap between statistical guarantee and computational efficiency for existing tensor clustering solutions. In this article, we propose a new dynamic tensor clustering method that works for a general-order dynamic tensor, and enjoys both strong statistical guarantee and high computational efficiency. Our proposal is based on a new structured tensor factorization that encourages both sparsity and smoothness in parameters along the specified tensor modes. Computationally, we develop a highly efficient optimization algorithm that benefits from substantial dimension reduction. Theoretically, we first establish a nonasymptotic error bound for the estimator from the structured tensor factorization. Built upon this error bound, we then derive the rate of convergence of the estimated cluster centers, and show that the estimated clusters recover the true cluster structures with high probability. Moreover, our proposed method can be naturally extended to co-clustering of multiple modes of the tensor data. The efficacy of our method is illustrated through simulations and a brain dynamic functional connectivity analysis from an autism spectrum disorder study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1894-1907 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1527701 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1527701 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1894-1907 Template-Type: ReDIF-Article 1.0 Author-Name: Zhao Ren Author-X-Name-First: Zhao Author-X-Name-Last: Ren Author-Name: Yongjian Kang Author-X-Name-First: Yongjian Author-X-Name-Last: Kang Author-Name: Yingying Fan Author-X-Name-First: Yingying Author-X-Name-Last: Fan Author-Name: Jinchi Lv Author-X-Name-First: Jinchi Author-X-Name-Last: Lv Title: Tuning-Free Heterogeneous Inference in Massive Networks Abstract: Heterogeneity is often natural in many contemporary applications involving massive data. While posing new challenges to effective learning, it can play a crucial role in powering meaningful scientific discoveries through the integration of information among subpopulations of interest. In this article, we exploit multiple networks with Gaussian graphs to encode the connectivity patterns of a large number of features on the subpopulations. To uncover the underlying sparsity structures across subpopulations, we suggest a framework of large-scale tuning-free heterogeneous inference, where the number of networks is allowed to diverge. In particular, two new tests, the chi-based and the linear functional-based tests, are introduced and their asymptotic null distributions are established. Under mild regularity conditions, we establish that both tests are optimal in achieving the testable region boundary and the sample size requirement for the latter test is minimal. Both theoretical guarantees and the tuning-free property stem from efficient multiple-network estimation by our newly suggested heterogeneous group square-root Lasso for high-dimensional multi-response regression with heterogeneous noises. To solve this convex program, we further introduce a scalable algorithm that enjoys provable convergence to the global optimum. Both computational and theoretical advantages are elucidated through simulation and real data examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1908-1925 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2018.1537920 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537920 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1908-1925 Template-Type: ReDIF-Article 1.0 Author-Name: Neil Pearce Author-X-Name-First: Neil Author-X-Name-Last: Pearce Title: Handbook of Statistical Methods for Case-Control Studies. Journal: Journal of the American Statistical Association Pages: 1926-1928 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1691865 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1691865 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1926-1928 Template-Type: ReDIF-Article 1.0 Author-Name: A. Alexandre Trindade Author-X-Name-First: A. Author-X-Name-Last: Alexandre Trindade Title: Linear Models and the Relevant Distributions and Matrix Algebra. Journal: Journal of the American Statistical Association Pages: 1928-1929 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1691864 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1691864 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1928-1929 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Editorial Collaborators Journal: Journal of the American Statistical Association Pages: W1930-W1938 Issue: 528 Volume: 114 Year: 2019 Month: 10 X-DOI: 10.1080/01621459.2019.1690842 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1690842 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:W1930-W1938 Template-Type: ReDIF-Article 1.0 Author-Name: Qinshu Lian Author-X-Name-First: Qinshu Author-X-Name-Last: Lian Author-Name: James S. Hodges Author-X-Name-First: James S. Author-X-Name-Last: Hodges Author-Name: Haitao Chu Author-X-Name-First: Haitao Author-X-Name-Last: Chu Title: A Bayesian Hierarchical Summary Receiver Operating Characteristic Model for Network Meta-Analysis of Diagnostic Tests Abstract: In studies evaluating the accuracy of diagnostic tests, three designs are commonly used, crossover, randomized, and noncomparative. Existing methods for meta-analysis of diagnostic tests mainly consider the simple cases in which the reference test in all or none of the studies can be considered a gold standard test, and in which all studies use either a randomized or noncomparative design. The proliferation of diagnostic instruments and the diversity of study designs create a need for more general methods to combine studies that include or do not include a gold standard test and that use various designs. This article extends the Bayesian hierarchical summary receiver operating characteristic model to network meta-analysis of diagnostic tests to simultaneously compare multiple tests within a missing data framework. The method accounts for correlations between multiple tests and for heterogeneity between studies. It also allows different studies to include different subsets of diagnostic tests and provides flexibility in the choice of summary statistics. The model is evaluated using simulations and illustrated using real data on tests for deep vein thrombosis, with sensitivity analyses. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 949-961 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1476239 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1476239 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:949-961 Template-Type: ReDIF-Article 1.0 Author-Name: Steffen Ventz Author-X-Name-First: Steffen Author-X-Name-Last: Ventz Author-Name: Matteo Cellamare Author-X-Name-First: Matteo Author-X-Name-Last: Cellamare Author-Name: Sergio Bacallado Author-X-Name-First: Sergio Author-X-Name-Last: Bacallado Author-Name: Lorenzo Trippa Author-X-Name-First: Lorenzo Author-X-Name-Last: Trippa Title: Bayesian Uncertainty Directed Trial Designs Abstract: Most Bayesian response-adaptive designs unbalance randomization rates toward the most promising arms with the goal of increasing the number of positive treatment outcomes during the study, even though the primary aim of the trial is different. We discuss Bayesian uncertainty directed designs (BUD), a class of Bayesian designs in which the investigator specifies an information measure tailored to the experiment. All decisions during the trial are selected to optimize the available information at the end of the study. The approach can be applied to several designs, ranging from early stage multi-arm trials to biomarker-driven and multi-endpoint studies. We discuss the asymptotic limit of the patient allocation proportion to treatments, and illustrate the finite-sample operating characteristics of BUD designs through examples, including multi-arm trials, biomarker-stratified trials, and trials with multiple co-primary endpoints. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 962-974 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1497497 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1497497 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:962-974 Template-Type: ReDIF-Article 1.0 Author-Name: Zhonghua Liu Author-X-Name-First: Zhonghua Author-X-Name-Last: Liu Author-Name: Xihong Lin Author-X-Name-First: Xihong Author-X-Name-Last: Lin Title: A Geometric Perspective on the Power of Principal Component Association Tests in Multiple Phenotype Studies Abstract: Joint analysis of multiple phenotypes can increase statistical power in genetic association studies. Principal component analysis, as a popular dimension reduction method, especially when the number of phenotypes is high dimensional, has been proposed to analyze multiple correlated phenotypes. It has been empirically observed that the first PC, which summarizes the largest amount of variance, can be less powerful than higher-order PCs and other commonly used methods in detecting genetic association signals. In this article, we investigate the properties of PCA-based multiple phenotype analysis from a geometric perspective by introducing a novel concept called principal angle. A particular PC is powerful if its principal angle is 0° and is powerless if its principal angle is 90° . Without prior knowledge about the true principal angle, each PC can be powerless. We propose linear, nonlinear, and data-adaptive omnibus tests by combining PCs. We demonstrate that the Wald test is a special quadratic PC-based test. We show that the omnibus PC test is robust and powerful in a wide range of scenarios. We study the properties of the proposed methods using power analysis and eigen-analysis. The subtle differences and close connections between these combined PC methods are illustrated graphically in terms of their rejection boundaries. Our proposed tests have convex acceptance regions and hence are admissible. The p-values for the proposed tests can be efficiently calculated analytically and the proposed tests have been implemented in a publicly available R package MPAT. We conduct simulation studies in both low- and high-dimensional settings with various signal vectors and correlation structures. We apply the proposed tests to the joint analysis of metabolic syndrome-related phenotypes with datasets collected from four international consortia to demonstrate the effectiveness of the proposed combined PC testing procedures. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 975-990 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1513363 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1513363 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:975-990 Template-Type: ReDIF-Article 1.0 Author-Name: Qian Li Author-X-Name-First: Qian Author-X-Name-Last: Li Author-Name: Damla Şentürk Author-X-Name-First: Damla Author-X-Name-Last: Şentürk Author-Name: Catherine A. Sugar Author-X-Name-First: Catherine A. Author-X-Name-Last: Sugar Author-Name: Shafali Jeste Author-X-Name-First: Shafali Author-X-Name-Last: Jeste Author-Name: Charlotte DiStefano Author-X-Name-First: Charlotte Author-X-Name-Last: DiStefano Author-Name: Joel Frohlich Author-X-Name-First: Joel Author-X-Name-Last: Frohlich Author-Name: Donatello Telesca Author-X-Name-First: Donatello Author-X-Name-Last: Telesca Title: Inferring Brain Signals Synchronicity From a Sample of EEG Readings Abstract: Inferring patterns of synchronous brain activity from a heterogeneous sample of electroencephalograms is scientifically and methodologically challenging. While it is intuitively and statistically appealing to rely on readings from more than one individual in order to highlight recurrent patterns of brain activation, pooling information across subjects presents nontrivial methodological problems. We discuss some of the scientific issues associated with the understanding of synchronized neuronal activity and propose a methodological framework for statistical inference from a sample of EEG readings. Our work builds on classical contributions in time-series, clustering, and functional data analysis, in an effort to reframe a challenging inferential problem in the context of familiar analytical techniques. Some attention is paid to computational issues, with a proposal based on the combination of machine learning and Bayesian techniques. Code submitted with this article was checked by an Associate Editor for Reproducibility and is available as an online supplement. Journal: Journal of the American Statistical Association Pages: 991-1001 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1518233 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1518233 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:991-1001 Template-Type: ReDIF-Article 1.0 Author-Name: Justin Strait Author-X-Name-First: Justin Author-X-Name-Last: Strait Author-Name: Oksana Chkrebtii Author-X-Name-First: Oksana Author-X-Name-Last: Chkrebtii Author-Name: Sebastian Kurtek Author-X-Name-First: Sebastian Author-X-Name-Last: Kurtek Title: Automatic Detection and Uncertainty Quantification of Landmarks on Elastic Curves Abstract: A population quantity of interest in statistical shape analysis is the location of landmarks, which are points that aid in reconstructing and representing shapes of objects. We provide an automated, model-based approach to inferring landmarks given a sample of shape data. The model is formulated based on a linear reconstruction of the shape, passing through the specified points, and a Bayesian inferential approach is described for estimating unknown landmark locations. The question of how many landmarks to select is addressed in two different ways: (1) by defining a criterion-based approach and (2) joint estimation of the number of landmarks along with their locations. Efficient methods for posterior sampling are also discussed. We motivate our approach using several simulated examples, as well as data obtained from applications in computer vision, biology, and medical imaging. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1002-1017 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1527224 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1527224 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1002-1017 Template-Type: ReDIF-Article 1.0 Author-Name: Yang Chen Author-X-Name-First: Yang Author-X-Name-Last: Chen Author-Name: Xiao-Li Meng Author-X-Name-First: Xiao-Li Author-X-Name-Last: Meng Author-Name: Xufei Wang Author-X-Name-First: Xufei Author-X-Name-Last: Wang Author-Name: David A. van Dyk Author-X-Name-First: David A. Author-X-Name-Last: van Dyk Author-Name: Herman L. Marshall Author-X-Name-First: Herman L. Author-X-Name-Last: Marshall Author-Name: Vinay L. Kashyap Author-X-Name-First: Vinay L. Author-X-Name-Last: Kashyap Title: Calibration Concordance for Astronomical Instruments via Multiplicative Shrinkage Abstract: Calibration data are often obtained by observing several well-understood objects simultaneously with multiple instruments, such as satellites for measuring astronomical sources. Analyzing such data and obtaining proper concordance among the instruments is challenging when the physical source models are not well understood, when there are uncertainties in “known” physical quantities, or when data quality varies in ways that cannot be fully quantified. Furthermore, the number of model parameters increases with both the number of instruments and the number of sources. Thus, concordance of the instruments requires careful modeling of the mean signals, the intrinsic source differences, and measurement errors. In this article, we propose a log-Normal model and a more general log-t model that respect the multiplicative nature of the mean signals via a half-variance adjustment, yet permit imperfections in the mean modeling to be absorbed by residual variances. We present analytical solutions in the form of power shrinkage in special cases and develop reliable Markov chain Monte Carlo algorithms for general cases, both of which are available in the Python module CalConcordance. We apply our method to several datasets including a combination of observations of active galactic nuclei (AGN) and spectral line emission from the supernova remnant E0102, obtained with a variety of X-ray telescopes such as Chandra, XMM- Newton, Suzaku, and Swift. The data are compiled by the International Astronomical Consortium for High Energy Calibration. We demonstrate that our method provides helpful and practical guidance for astrophysicists when adjusting for disagreements among instruments. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1018-1037 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1528978 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1528978 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1018-1037 Template-Type: ReDIF-Article 1.0 Author-Name: David Benkeser Author-X-Name-First: David Author-X-Name-Last: Benkeser Author-Name: Peter B. Gilbert Author-X-Name-First: Peter B. Author-X-Name-Last: Gilbert Author-Name: Marco Carone Author-X-Name-First: Marco Author-X-Name-Last: Carone Title: Estimating and Testing Vaccine Sieve Effects Using Machine Learning Abstract: When available, vaccines are an effective means of disease prevention. Unfortunately, efficacious vaccines have not yet been developed for several major infectious diseases, including HIV and malaria. Vaccine sieve analysis studies whether and how the efficacy of a vaccine varies with the genetics of the pathogen of interest, which can guide subsequent vaccine development and deployment. In sieve analyses, the effect of the vaccine on the cumulative incidence corresponding to each of several possible genotypes is often assessed within a competing risks framework. In the context of clinical trials, the estimators employed in these analyses generally do not account for covariates, even though the latter may be predictive of the study endpoint or censoring. Motivated by two recent preventive vaccine efficacy trials for HIV and malaria, we develop new methodology for vaccine sieve analysis. Our approach offers improved validity and efficiency relative to existing approaches by allowing covariate adjustment through ensemble machine learning. We derive results that indicate how to perform statistical inference using our estimators. Our analysis of the HIV and malaria trials shows markedly increased precision—up to doubled efficiency in both trials—under more plausible assumptions compared with standard methodology. Our findings provide greater evidence for vaccine sieve effects in both trials. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1038-1049 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1529594 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1529594 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1038-1049 Template-Type: ReDIF-Article 1.0 Author-Name: Furong Li Author-X-Name-First: Furong Author-X-Name-Last: Li Author-Name: Huiyan Sang Author-X-Name-First: Huiyan Author-X-Name-Last: Sang Title: Spatial Homogeneity Pursuit of Regression Coefficients for Large Datasets Abstract: Spatial regression models have been widely used to describe the relationship between a response variable and some explanatory variables over a region of interest, taking into account the spatial dependence of the observations. In many applications, relationships between response variables and covariates are expected to exhibit complex spatial patterns. We propose a new approach, referred to as spatially clustered coefficient (SCC) regression, to detect spatially clustered patterns in the regression coefficients. It incorporates spatial neighborhood information through a carefully constructed regularization to automatically detect change points in space and to achieve computational scalability. Our numerical studies suggest that SCC works very effectively, capturing not only clustered coefficients, but also smoothly varying coefficients because of its strong local adaptivity. This flexibility allows researchers to explore various spatial structures in regression coefficients. We also establish theoretical properties of SCC. We use SCC to explore the relationship between the temperature and salinity of sea water in the Atlantic basin; this can provide important insights about the evolution of individual water masses and the pathway and strength of meridional overturning circulation in oceanography. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1050-1062 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1529595 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1529595 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1050-1062 Template-Type: ReDIF-Article 1.0 Author-Name: Samuel I. Berchuck Author-X-Name-First: Samuel I. Author-X-Name-Last: Berchuck Author-Name: Jean-Claude Mwanza Author-X-Name-First: Jean-Claude Author-X-Name-Last: Mwanza Author-Name: Joshua L. Warren Author-X-Name-First: Joshua L. Author-X-Name-Last: Warren Title: Diagnosing Glaucoma Progression With Visual Field Data Using a Spatiotemporal Boundary Detection Method Abstract: Diagnosing glaucoma progression is critical for limiting irreversible vision loss. A common method for assessing glaucoma progression uses a longitudinal series of visual fields (VFs) acquired at regular intervals. VF data are characterized by a complex spatiotemporal structure due to the data generating process and ocular anatomy. Thus, advanced statistical methods are needed to make clinical determinations regarding progression status. We introduce a spatiotemporal boundary detection model that allows the underlying anatomy of the optic disc to dictate the spatial structure of the VF data across time. We show that our new method provides novel insight into vision loss that improves diagnosis of glaucoma progression using data from the Vein Pulsation Study Trial in Glaucoma and the Lions Eye Institute trial registry. Simulations are presented, showing the proposed methodology is preferred over existing spatial methods for VF data. Supplementary materials for this article are available online and the method is implemented in the R package womblR. Journal: Journal of the American Statistical Association Pages: 1063-1074 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1537911 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537911 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1063-1074 Template-Type: ReDIF-Article 1.0 Author-Name: Jesson J. Einmahl Author-X-Name-First: Jesson J. Author-X-Name-Last: Einmahl Author-Name: John H. J. Einmahl Author-X-Name-First: John H. J. Author-X-Name-Last: Einmahl Author-Name: Laurens de Haan Author-X-Name-First: Laurens Author-X-Name-Last: de Haan Title: Limits to Human Life Span Through Extreme Value Theory Abstract: There is no scientific consensus on the fundamental question whether the probability distribution of the human life span has a finite endpoint or not and, if so, whether this upper limit changes over time. Our study uses a unique dataset of the ages at death—in days—of all (about 285,000) Dutch residents, born in the Netherlands, who died in the years 1986–2015 at a minimum age of 92 years and is based on extreme value theory, the coherent approach to research problems of this type. Unlike some other studies, we base our analysis on the configuration of thousands of mortality data of old people, not just the few oldest old. We find compelling statistical evidence that there is indeed an upper limit to the life span of men and to that of women for all the 30 years we consider and, moreover, that there are no indications of trends in these upper limits over the last 30 years, despite the fact that the number of people reaching high age (say 95 years) was almost tripling. We also present estimates for the endpoints, for the force of mortality at very high age, and for the so-called perseverance parameter. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1075-1080 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1537912 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537912 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1075-1080 Template-Type: ReDIF-Article 1.0 Author-Name: Shahin Tavakoli Author-X-Name-First: Shahin Author-X-Name-Last: Tavakoli Author-Name: Davide Pigoli Author-X-Name-First: Davide Author-X-Name-Last: Pigoli Author-Name: John A. D. Aston Author-X-Name-First: John A. D. Author-X-Name-Last: Aston Author-Name: John S. Coleman Author-X-Name-First: John S. Author-X-Name-Last: Coleman Title: A Spatial Modeling Approach for Linguistic Object Data: Analyzing Dialect Sound Variations Across Great Britain Abstract: Dialect variation is of considerable interest in linguistics and other social sciences. However, traditionally it has been studied using proxies (transcriptions) rather than acoustic recordings directly. We introduce novel statistical techniques to analyze geolocalized speech recordings and to explore the spatial variation of pronunciations continuously over the region of interest, as opposed to traditional isoglosses, which provide a discrete partition of the region. Data of this type require an explicit modeling of the variation in the mean and the covariance. Usual Euclidean metrics are not appropriate, and we therefore introduce the concept of d-covariance, which allows consistent estimation both in space and at individual locations. We then propose spatial smoothing for these objects which accounts for the possibly nonconvex geometry of the domain of interest. We apply the proposed method to data from the spoken part of the British National Corpus, deposited at the British Library, London, and we produce maps of the dialect variation over Great Britain. In addition, the methods allow for acoustic reconstruction across the domain of interest, allowing researchers to listen to the statistical analysis. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1081-1096 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2019.1607357 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1607357 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1081-1096 Template-Type: ReDIF-Article 1.0 Author-Name: Ian L. Dryden Author-X-Name-First: Ian L. Author-X-Name-Last: Dryden Author-Name: Simon P. Preston Author-X-Name-First: Simon P. Author-X-Name-Last: Preston Author-Name: Katie E. Severn Author-X-Name-First: Katie E. Author-X-Name-Last: Severn Title: Discussion: Object-Oriented Data Analysis, Power Metrics, and Graph Laplacians Journal: Journal of the American Statistical Association Pages: 1097-1098 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2019.1635477 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635477 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1097-1098 Template-Type: ReDIF-Article 1.0 Author-Name: Alexander Petersen Author-X-Name-First: Alexander Author-X-Name-Last: Petersen Author-Name: Hans-Georg Müller Author-X-Name-First: Hans-Georg Author-X-Name-Last: Müller Title: Discussion: A Spatial Modeling Approach for Linguistic Object Data: Analyzing Dialect Sound Variations Across Great Britain, by Shahin Tavakoli et al. Journal: Journal of the American Statistical Association Pages: 1099-1101 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2019.1635478 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635478 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1099-1101 Template-Type: ReDIF-Article 1.0 Author-Name: J. S. Marron Author-X-Name-First: J. S. Author-X-Name-Last: Marron Title: Discussion: A Spatial Modeling Approach for Linguistic Object Data: Analysing Dialect Sound Variations Across Great Britain, by Shahin Tavakoli et al. Journal: Journal of the American Statistical Association Pages: 1102-1102 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2019.1639513 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1639513 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1102-1102 Template-Type: ReDIF-Article 1.0 Author-Name: Shahin Tavakoli Author-X-Name-First: Shahin Author-X-Name-Last: Tavakoli Author-Name: Davide Pigoli Author-X-Name-First: Davide Author-X-Name-Last: Pigoli Author-Name: John A. D. Aston Author-X-Name-First: John A. D. Author-X-Name-Last: Aston Author-Name: John S. Coleman Author-X-Name-First: John S. Author-X-Name-Last: Coleman Title: Rejoinder for “A Spatial Modeling Approach for Linguistic Object Data: Analyzing Dialect Sound Variations Across Great Britain” Journal: Journal of the American Statistical Association Pages: 1103-1104 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2019.1655931 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1655931 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1103-1104 Template-Type: ReDIF-Article 1.0 Author-Name: Patrick Rubin-Delanchy Author-X-Name-First: Patrick Author-X-Name-Last: Rubin-Delanchy Author-Name: Nicholas A. Heard Author-X-Name-First: Nicholas A. Author-X-Name-Last: Heard Author-Name: Daniel J. Lawson Author-X-Name-First: Daniel J. Author-X-Name-Last: Lawson Title: Meta-Analysis of Mid-p-Values: Some New Results based on the Convex Order Abstract: The mid-p-value is a proposed improvement on the ordinary p-value for the case where the test statistic is partially or completely discrete. In this case, the ordinary p-value is conservative, meaning that its null distribution is larger than a uniform distribution on the unit interval, in the usual stochastic order. The mid-p-value is not conservative. However, its null distribution is dominated by the uniform distribution in a different stochastic order, called the convex order. The property leads us to discover some new finite-sample and asymptotic bounds on functions of mid-p-values, which can be used to combine results from different hypothesis tests conservatively, yet more powerfully, using mid-p-values rather than p-values. Our methodology is demonstrated on real data from a cyber-security application. Journal: Journal of the American Statistical Association Pages: 1105-1112 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1469994 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469994 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1105-1112 Template-Type: ReDIF-Article 1.0 Author-Name: Jeffrey W. Miller Author-X-Name-First: Jeffrey W. Author-X-Name-Last: Miller Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Robust Bayesian Inference via Coarsening Abstract: The standard approach to Bayesian inference is based on the assumption that the distribution of the data belongs to the chosen model class. However, even a small violation of this assumption can have a large impact on the outcome of a Bayesian procedure. We introduce a novel approach to Bayesian inference that improves robustness to small departures from the model: rather than conditioning on the event that the observed data are generated by the model, one conditions on the event that the model generates data close to the observed data, in a distributional sense. When closeness is defined in terms of relative entropy, the resulting “coarsened” posterior can be approximated by simply tempering the likelihood—that is, by raising the likelihood to a fractional power—thus, inference can usually be implemented via standard algorithms, and one can even obtain analytical solutions when using conjugate priors. Some theoretical properties are derived, and we illustrate the approach with real and simulated data using mixture models and autoregressive models of unknown order. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1113-1125 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1469995 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469995 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1113-1125 Template-Type: ReDIF-Article 1.0 Author-Name: Mickaël De Backer Author-X-Name-First: Mickaël Author-X-Name-Last: De Backer Author-Name: Anouar El Ghouch Author-X-Name-First: Anouar El Author-X-Name-Last: Ghouch Author-Name: Ingrid Van Keilegom Author-X-Name-First: Ingrid Author-X-Name-Last: Van Keilegom Title: An Adapted Loss Function for Censored Quantile Regression Abstract: In this article, we study a novel approach for the estimation of quantiles when facing potential right censoring of the responses. Contrary to the existing literature on the subject, the adopted strategy of this article is to tackle censoring at the very level of the loss function usually employed for the computation of quantiles, the so-called “check” function. For interpretation purposes, a simple comparison with the latter reveals how censoring is accounted for in the newly proposed loss function. Subsequently, when considering the inclusion of covariates for conditional quantile estimation, by defining a new general loss function the proposed methodology opens the gate to numerous parametric, semiparametric, and nonparametric modeling techniques. To illustrate this statement, we consider the well-studied linear regression under the usual assumption of conditional independence between the true response and the censoring variable. For practical minimization of the studied loss function, we also provide a simple algorithmic procedure shown to yield satisfactory results for the proposed estimator with respect to the existing literature in an extensive simulation study. From a more theoretical prospect, consistency and asymptotic normality of the estimator for linear regression are obtained using several recent results on nonsmooth semiparametric estimation equations with an infinite-dimensional nuisance parameter, while numerical examples illustrate the adequateness of a simple bootstrap procedure for inferential purposes. Lastly, an application to a real dataset is used to further illustrate the validity and finite sample performance of the proposed estimator. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1126-1137 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1469996 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469996 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1126-1137 Template-Type: ReDIF-Article 1.0 Author-Name: Xinbing Kong Author-X-Name-First: Xinbing Author-X-Name-Last: Kong Author-Name: Jiangyan Wang Author-X-Name-First: Jiangyan Author-X-Name-Last: Wang Author-Name: Jinbao Xing Author-X-Name-First: Jinbao Author-X-Name-Last: Xing Author-Name: Chao Xu Author-X-Name-First: Chao Author-X-Name-Last: Xu Author-Name: Chao Ying Author-X-Name-First: Chao Author-X-Name-Last: Ying Title: Factor and Idiosyncratic Empirical Processes Abstract: The distributions of the common and idiosyncratic components for an individual variable are important in forecasting and applications. However, they are not identified with low-dimensional observations. Using the recently developed theory for large dimensional approximate factor model for large panel data, the common and idiosyncratic components can be estimated consistently. Based on the estimated common and idiosyncratic components, we construct the empirical processes for estimation of the distribution functions of the common and idiosyncratic components. We prove that the two empirical processes are oracle efficient when T = o(p) where p and T are the dimension and sample size, respectively. This demonstrates that the factor and idiosyncratic empirical processes behave as well as the empirical processes pretending that the common and idiosyncratic components for an individual variable are directly observable. Based on this oracle property, we construct simultaneous confidence bands (SCBs) for the distributions of the common and idiosyncratic components. For the first-order consistency of the estimated distribution functions, T=o(p)$\sqrt{T} =o(p)$ suffices. Extensive simulation studies check that the estimated bands have good coverage frequencies. Our real data analysis shows that the common-component distribution has a structural change during the crisis in 2008, while the idiosyncratic-component distribution does not change much. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1138-1146 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1469997 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469997 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1138-1146 Template-Type: ReDIF-Article 1.0 Author-Name: Yixin Wang Author-X-Name-First: Yixin Author-X-Name-Last: Wang Author-Name: David M. Blei Author-X-Name-First: David M. Author-X-Name-Last: Blei Title: Frequentist Consistency of Variational Bayes Abstract: A key challenge for modern Bayesian statistics is how to perform scalable inference of posterior distributions. To address this challenge, variational Bayes (VB) methods have emerged as a popular alternative to the classical Markov chain Monte Carlo (MCMC) methods. VB methods tend to be faster while achieving comparable predictive performance. However, there are few theoretical results around VB. In this article, we establish frequentist consistency and asymptotic normality of VB methods. Specifically, we connect VB methods to point estimates based on variational approximations, called frequentist variational approximations, and we use the connection to prove a variational Bernstein–von Mises theorem. The theorem leverages the theoretical characterizations of frequentist variational approximations to understand asymptotic properties of VB. In summary, we prove that (1) the VB posterior converges to the Kullback–Leibler (KL) minimizer of a normal distribution, centered at the truth and (2) the corresponding variational expectation of the parameter is consistent and asymptotically normal. As applications of the theorem, we derive asymptotic properties of VB posteriors in Bayesian mixture models, Bayesian generalized linear mixed models, and Bayesian stochastic block models. We conduct a simulation study to illustrate these theoretical results. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1147-1161 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1473776 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1473776 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1147-1161 Template-Type: ReDIF-Article 1.0 Author-Name: Ery Arias-Castro Author-X-Name-First: Ery Author-X-Name-Last: Arias-Castro Author-Name: Beatriz Pateiro-López Author-X-Name-First: Beatriz Author-X-Name-Last: Pateiro-López Author-Name: Alberto Rodríguez-Casal Author-X-Name-First: Alberto Author-X-Name-Last: Rodríguez-Casal Title: Minimax Estimation of the Volume of a Set Under the Rolling Ball Condition Abstract: We consider the problem of estimating the volume of a compact domain in a Euclidean space based on a uniform sample from the domain. We assume that the domain has a boundary with positive reach. We propose a data-splitting approach to correct the bias of the plug-in estimator based on the sample α-convex hull. We show that this simple estimator achieves a minimax lower bound that we derive. Some numerical experiments corroborate our theoretical findings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1162-1173 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1482751 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482751 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1162-1173 Template-Type: ReDIF-Article 1.0 Author-Name: Marco Carone Author-X-Name-First: Marco Author-X-Name-Last: Carone Author-Name: Alexander R. Luedtke Author-X-Name-First: Alexander R. Author-X-Name-Last: Luedtke Author-Name: Mark J. van der Laan Author-X-Name-First: Mark J. Author-X-Name-Last: van der Laan Title: Toward Computerized Efficient Estimation in Infinite-Dimensional Models Abstract: Despite the risk of misspecification they are tied to, parametric models continue to be used in statistical practice because they are simple and convenient to use. In particular, efficient estimation procedures in parametric models are easy to describe and implement. Unfortunately, the same cannot be said of semiparametric and nonparametric models. While the latter often reflect the level of available scientific knowledge more appropriately, performing efficient inference in these models is generally challenging. The efficient influence function is a key analytic object from which the construction of asymptotically efficient estimators can potentially be streamlined. However, the theoretical derivation of the efficient influence function requires specialized knowledge and is often a difficult task, even for experts. In this article, we present a novel representation of the efficient influence function and describe a numerical procedure for approximating its evaluation. The approach generalizes the nonparametric procedures of Frangakis et al. and Luedtke, Carone, and van der Laan to arbitrary models. We present theoretical results to support our proposal and illustrate the method in the context of several semiparametric problems. The proposed approach is an important step toward automating efficient estimation in general statistical models, thereby rendering more accessible the use of realistic models in statistical analyses. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1174-1190 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1482752 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482752 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1174-1190 Template-Type: ReDIF-Article 1.0 Author-Name: Lixia Hu Author-X-Name-First: Lixia Author-X-Name-Last: Hu Author-Name: Tao Huang Author-X-Name-First: Tao Author-X-Name-Last: Huang Author-Name: Jinhong You Author-X-Name-First: Jinhong Author-X-Name-Last: You Title: Estimation and Identification of a Varying-Coefficient Additive Model for Locally Stationary Processes Abstract: The additive model and the varying-coefficient model are both powerful regression tools, with wide practical applications. However, our empirical study on a financial data has shown that both of these models have drawbacks when applied to locally stationary time series. For the analysis of functional data, Zhang and Wang have proposed a flexible regression method, called the varying-coefficient additive model (VCAM), and presented a two-step spline estimation method. Motivated by their approach, we adopt the VCAM to characterize the time-varying regression function in a locally stationary context. We propose a three-step spline estimation method and show its consistency and asymptotic normality. For the purpose of model diagnosis, we suggest an L2-distance test statistic to check multiplicative assumption, and raise a two-stage penalty procedure to identify the additive terms and the varying-coefficient terms provided that the VCAM is applicable. We also present the asymptotic distribution of the proposed test statistics and demonstrate the consistency of the two-stage model identification procedure. Simulation studies investigating the finite-sample performance of the estimation and model diagnosis methods confirm the validity of our asymptotic theory. The financial data are also considered. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1191-1204 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1482753 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482753 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1191-1204 Template-Type: ReDIF-Article 1.0 Author-Name: Naveen N. Narisetty Author-X-Name-First: Naveen N. Author-X-Name-Last: Narisetty Author-Name: Juan Shen Author-X-Name-First: Juan Author-X-Name-Last: Shen Author-Name: Xuming He Author-X-Name-First: Xuming Author-X-Name-Last: He Title: Skinny Gibbs: A Consistent and Scalable Gibbs Sampler for Model Selection Abstract: We consider the computational and statistical issues for high-dimensional Bayesian model selection under the Gaussian spike and slab priors. To avoid large matrix computations needed in a standard Gibbs sampler, we propose a novel Gibbs sampler called “Skinny Gibbs” which is much more scalable to high-dimensional problems, both in memory and in computational efficiency. In particular, its computational complexity grows only linearly in p, the number of predictors, while retaining the property of strong model selection consistency even when p is much greater than the sample size n. The present article focuses on logistic regression due to its broad applicability as a representative member of the generalized linear models. We compare our proposed method with several leading variable selection methods through a simulation study to show that Skinny Gibbs has a strong performance as indicated by our theoretical work. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1205-1217 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1482754 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482754 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1205-1217 Template-Type: ReDIF-Article 1.0 Author-Name: Lingrui Gan Author-X-Name-First: Lingrui Author-X-Name-Last: Gan Author-Name: Naveen N. Narisetty Author-X-Name-First: Naveen N. Author-X-Name-Last: Narisetty Author-Name: Feng Liang Author-X-Name-First: Feng Author-X-Name-Last: Liang Title: Bayesian Regularization for Graphical Models With Unequal Shrinkage Abstract: We consider a Bayesian framework for estimating a high-dimensional sparse precision matrix, in which adaptive shrinkage and sparsity are induced by a mixture of Laplace priors. Besides discussing our formulation from the Bayesian standpoint, we investigate the MAP (maximum a posteriori) estimator from a penalized likelihood perspective that gives rise to a new nonconvex penalty approximating the ℓ0 penalty. Optimal error rates for estimation consistency in terms of various matrix norms along with selection consistency for sparse structure recovery are shown for the unique MAP estimator under mild conditions. For fast and efficient computation, an EM algorithm is proposed to compute the MAP estimator of the precision matrix and (approximate) posterior probabilities on the edges of the underlying sparse structure. Through extensive simulation studies and a real application to a call center data, we have demonstrated the fine performance of our method compared with existing alternatives. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1218-1231 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1482755 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482755 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1218-1231 Template-Type: ReDIF-Article 1.0 Author-Name: Fei Gao Author-X-Name-First: Fei Author-X-Name-Last: Gao Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: David Couper Author-X-Name-First: David Author-X-Name-Last: Couper Author-Name: D. Y. Lin Author-X-Name-First: D. Y. Author-X-Name-Last: Lin Title: Semiparametric Regression Analysis of Multiple Right- and Interval-Censored Events Abstract: Health sciences research often involves both right- and interval-censored events because the occurrence of a symptomatic disease can only be observed up to the end of follow-up, while the occurrence of an asymptomatic disease can only be detected through periodic examinations. We formulate the effects of potentially time-dependent covariates on the joint distribution of multiple right- and interval-censored events through semiparametric proportional hazards models with random effects that capture the dependence both within and between the two types of events. We consider nonparametric maximum likelihood estimation and develop a simple and stable EM algorithm for computation. We show that the resulting estimators are consistent and the parametric components are asymptotically normal and efficient with a covariance matrix that can be consistently estimated by profile likelihood or nonparametric bootstrap. In addition, we leverage the joint modelling to provide dynamic prediction of disease incidence based on the evolving event history. Furthermore, we assess the performance of the proposed methods through extensive simulation studies. Finally, we provide an application to a major epidemiological cohort study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1232-1240 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1482756 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482756 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1232-1240 Template-Type: ReDIF-Article 1.0 Author-Name: Shirong Deng Author-X-Name-First: Shirong Author-X-Name-Last: Deng Author-Name: Xingqiu Zhao Author-X-Name-First: Xingqiu Author-X-Name-Last: Zhao Title: Covariate-Adjusted Regression for Distorted Longitudinal Data With Informative Observation Times Abstract: In many longitudinal studies, repeated response and predictors are not directly observed, but can be treated as distorted by unknown functions of a common confounding covariate. Moreover, longitudinal data involve an observation process which may be informative with a longitudinal response process in practice. To deal with such complex data, we propose a class of flexible semiparametric covariate-adjusted joint models. The new models not only allow for the longitudinal response to be correlated with observation times through latent variables and completely unspecified link functions, but they also characterize distorted longitudinal response and predictors by unknown multiplicative factors depending on time and a confounding covariate. For estimation of regression parameters in the proposed models, we develop a novel covariate-adjusted estimating equation approach which does not rely on forms of link functions and distributions of frailties. The asymptotic properties of resulting parameter estimators are established and examined by simulation studies. A longitudinal data example containing calcium absorption and intake measurements is provided for illustration. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1241-1250 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1482757 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482757 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1241-1250 Template-Type: ReDIF-Article 1.0 Author-Name: Jia Guo Author-X-Name-First: Jia Author-X-Name-Last: Guo Author-Name: Bu Zhou Author-X-Name-First: Bu Author-X-Name-Last: Zhou Author-Name: Jin-Ting Zhang Author-X-Name-First: Jin-Ting Author-X-Name-Last: Zhang Title: New Tests for Equality of Several Covariance Functions for Functional Data Abstract: In this article, we propose two new tests for the equality of the covariance functions of several functional populations, namely, a quasi-GPF test and a quasi-Fmax  test whose test statistics are obtained via globalizing a pointwise quasi-F-test statistic with integration and taking its supremum over some time interval of interest, respectively. Unlike several existing tests, they are scale-invariant in the sense that their test statistics will not change if we multiply each of the observed functions by any nonzero function of time. We derive the asymptotic random expressions of the two tests under the null hypothesis and show that under some mild conditions, the asymptotic null distribution of the quasi-GPF test is a chi-squared-type mixture whose distribution can be well approximated by a simple-scaled chi-squared distribution. We also propose a random permutation method for approximating the null distributions of the quasi-GPF and Fmax  tests. The asymptotic distributions of the two tests under a local alternative are also investigated and the two tests are shown to be root-n consistent. A theoretical power comparison between the quasi-GPF test and the L2-norm-based test proposed in the literature is also given. Simulation studies are presented to demonstrate the finite-sample performance of the new tests against five existing tests. An illustrative example is also presented. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1251-1263 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1483827 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1483827 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1251-1263 Template-Type: ReDIF-Article 1.0 Author-Name: Niklas Pfister Author-X-Name-First: Niklas Author-X-Name-Last: Pfister Author-Name: Peter Bühlmann Author-X-Name-First: Peter Author-X-Name-Last: Bühlmann Author-Name: Jonas Peters Author-X-Name-First: Jonas Author-X-Name-Last: Peters Title: Invariant Causal Prediction for Sequential Data Abstract: We investigate the problem of inferring the causal predictors of a response Y from a set of d explanatory variables (X1, …, Xd). Classical ordinary least-square regression includes all predictors that reduce the variance of Y. Using only the causal predictors instead leads to models that have the advantage of remaining invariant under interventions; loosely speaking they lead to invariance across different “environments” or “heterogeneity patterns.” More precisely, the conditional distribution of Y given its causal predictors is the same for all observations, provided that there are no interventions on Y. Recent work exploits such a stability to infer causal relations from data with different but known environments. We show that even without having knowledge of the environments or heterogeneity pattern, inferring causal relations is possible for time-ordered (or any other type of sequentially ordered) data. In particular, this allows detecting instantaneous causal relations in multivariate linear time series, which is usually not the case for Granger causality. Besides novel methodology, we provide statistical confidence bounds and asymptotic detection results for inferring causal predictors, and present an application to monetary policy in macroeconomics. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1264-1276 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1491403 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1491403 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1264-1276 Template-Type: ReDIF-Article 1.0 Author-Name: Wei Qian Author-X-Name-First: Wei Author-X-Name-Last: Qian Author-Name: Shanshan Ding Author-X-Name-First: Shanshan Author-X-Name-Last: Ding Author-Name: R. Dennis Cook Author-X-Name-First: R. Dennis Author-X-Name-Last: Cook Title: Sparse Minimum Discrepancy Approach to Sufficient Dimension Reduction with Simultaneous Variable Selection in Ultrahigh Dimension Abstract: Sufficient dimension reduction (SDR) is known to be a powerful tool for achieving data reduction and data visualization in regression and classification problems. In this work, we study ultrahigh-dimensional SDR problems and propose solutions under a unified minimum discrepancy approach with regularization. When p grows exponentially with n, consistency results in both central subspace estimation and variable selection are established simultaneously for important SDR methods, including sliced inverse regression (SIR), principal fitted component (PFC), and sliced average variance estimation (SAVE). Special sparse structures of large predictor or error covariance are also considered for potentially better performance. In addition, the proposed approach is equipped with a new algorithm to efficiently solve the regularized objective functions and a new data-driven procedure to determine structural dimension and tuning parameters, without the need to invert a large covariance matrix. Simulations and a real data analysis are offered to demonstrate the promise of our proposal in ultrahigh-dimensional settings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1277-1290 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1497498 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1497498 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1277-1290 Template-Type: ReDIF-Article 1.0 Author-Name: Qingyuan Zhao Author-X-Name-First: Qingyuan Author-X-Name-Last: Zhao Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Author-Name: Weijie Su Author-X-Name-First: Weijie Author-X-Name-Last: Su Title: Multiple Testing When Many p-Values are Uniformly Conservative, with Application to Testing Qualitative Interaction in Educational Interventions Abstract: In the evaluation of treatment effects, it is of major policy interest to know if the treatment is beneficial for some and harmful for others, a phenomenon known as qualitative interaction. We formulate this question as a multiple testing problem with many conservative null p-values, in which the classical multiple testing methods may lose power substantially. We propose a simple technique—conditioning—to improve the power. A crucial assumption we need is uniform conservativeness, meaning for any conservative p-value p, the conditional distribution (p/τ) | p ⩽ τ is stochastically larger than the uniform distribution on (0, 1) for any τ. We show this property holds for one-sided tests in a one-dimensional exponential family (e.g., testing for qualitative interaction) as well as testing |μ| ⩽ η using a statistic Y ∼ N(μ, 1) (e.g., testing for practical importance with threshold η). We propose an adaptive method to select the threshold τ. Our theoretical and simulation results suggest that the proposed tests gain significant power when many p-values are uniformly conservative and lose little power when no p-value is uniformly conservative. We apply our method to two educational intervention datasets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1291-1304 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1497499 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1497499 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1291-1304 Template-Type: ReDIF-Article 1.0 Author-Name: Yuqing Pan Author-X-Name-First: Yuqing Author-X-Name-Last: Pan Author-Name: Qing Mai Author-X-Name-First: Qing Author-X-Name-Last: Mai Author-Name: Xin Zhang Author-X-Name-First: Xin Author-X-Name-Last: Zhang Title: Covariate-Adjusted Tensor Classification in High Dimensions Abstract: In contemporary scientific research, it is often of great interest to predict a categorical response based on a high-dimensional tensor (i.e., multi-dimensional array) and additional covariates. Motivated by applications in science and engineering, we propose a comprehensive and interpretable discriminant analysis model, called the CATCH model (short for covariate-adjusted tensor classification in high-dimensions). The CATCH model efficiently integrates the covariates and the tensor to predict the categorical outcome. It also jointly explains the complicated relationships among the covariates, the tensor predictor, and the categorical response. The tensor structure is used to achieve easy interpretation and accurate prediction. To tackle the new computational and statistical challenges arising from the intimidating tensor dimensions, we propose a penalized approach to select a subset of the tensor predictor entries that affect classification after adjustment for the covariates. An efficient algorithm is developed to take advantage of the tensor structure in the penalized estimation. Theoretical results confirm that the proposed method achieves variable selection and prediction consistency, even when the tensor dimension is much larger than the sample size. The superior performance of our method over existing methods is demonstrated in extensive simulated and real data examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1305-1319 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1497500 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1497500 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1305-1319 Template-Type: ReDIF-Article 1.0 Author-Name: Shulei Wang Author-X-Name-First: Shulei Author-X-Name-Last: Wang Author-Name: Ming Yuan Author-X-Name-First: Ming Author-X-Name-Last: Yuan Title: Combined Hypothesis Testing on Graphs With Applications to Gene Set Enrichment Analysis Abstract: Motivated by gene set enrichment analysis, we investigate the problem of combined hypothesis testing on a graph. A general framework is introduced to make effective use of the structural information of the underlying graph when testing multivariate means. A new testing procedure is proposed within this framework, and shown to be optimal in that it can consistently detect departures from the collective null at a rate that no other test could improve, for almost all graphs. We also provide general performance bounds for the proposed test under any specific graph, and illustrate their utility through several common types of graphs. Numerical experiments are presented to further demonstrate the merits of our approach. Journal: Journal of the American Statistical Association Pages: 1320-1338 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1497501 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1497501 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1320-1338 Template-Type: ReDIF-Article 1.0 Author-Name: Frank Windmeijer Author-X-Name-First: Frank Author-X-Name-Last: Windmeijer Author-Name: Helmut Farbmacher Author-X-Name-First: Helmut Author-X-Name-Last: Farbmacher Author-Name: Neil Davies Author-X-Name-First: Neil Author-X-Name-Last: Davies Author-Name: George Davey Smith Author-X-Name-First: George Author-X-Name-Last: Davey Smith Title: On the Use of the Lasso for Instrumental Variables Estimation with Some Invalid Instruments Abstract: We investigate the behavior of the Lasso for selecting invalid instruments in linear instrumental variables models for estimating causal effects of exposures on outcomes, as proposed recently by Kang et al. Invalid instruments are such that they fail the exclusion restriction and enter the model as explanatory variables. We show that for this setup, the Lasso may not consistently select the invalid instruments if these are relatively strong. We propose a median estimator that is consistent when less than 50% of the instruments are invalid, and its consistency does not depend on the relative strength of the instruments, or their correlation structure. We show that this estimator can be used for adaptive Lasso estimation, with the resulting estimator having oracle properties. The methods are applied to a Mendelian randomization study to estimate the causal effect of body mass index (BMI) on diastolic blood pressure, using data on individuals from the UK Biobank, with 96 single nucleotide polymorphisms as potential instruments for BMI. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1339-1350 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1498346 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1498346 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1339-1350 Template-Type: ReDIF-Article 1.0 Author-Name: Yuval Benjamini Author-X-Name-First: Yuval Author-X-Name-Last: Benjamini Author-Name: Jonathan Taylor Author-X-Name-First: Jonathan Author-X-Name-Last: Taylor Author-Name: Rafael A. Irizarry Author-X-Name-First: Rafael A. Author-X-Name-Last: Irizarry Title: Selection-Corrected Statistical Inference for Region Detection With High-Throughput Assays Abstract: Scientists use high-dimensional measurement assays to detect and prioritize regions of strong signal in spatially organized domain. Examples include finding methylation-enriched genomic regions using microarrays, and active cortical areas using brain-imaging. The most common procedure for detecting potential regions is to group neighboring sites where the signal passed a threshold. However, one needs to account for the selection bias induced by this procedure to avoid diminishing effects when generalizing to a population. This article introduces pin-down inference, a model and an inference framework that permit population inference for these detected regions. Pin-down inference provides nonasymptotic point and confidence interval estimators for the mean effect in the region that account for local selection bias. Our estimators accommodate nonstationary covariances that are typical of these data, allowing researchers to better compare regions of different sizes and correlation structures. Inference is provided within a conditional one-parameter exponential family per region, with truncations that match the selection constraints. A secondary screening-and-adjustment step allows pruning the set of detected regions, while controlling the false-coverage rate over the reported regions. We apply the method to genomic regions with differing DNA-methylation rates across tissue. Our method provides superior power compared to other conditional and nonparametric approaches. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1351-1365 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1498347 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1498347 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1351-1365 Template-Type: ReDIF-Article 1.0 Author-Name: Abdelaati Daouia Author-X-Name-First: Abdelaati Author-X-Name-Last: Daouia Author-Name: Irène Gijbels Author-X-Name-First: Irène Author-X-Name-Last: Gijbels Author-Name: Gilles Stupfler Author-X-Name-First: Gilles Author-X-Name-Last: Stupfler Title: Extremiles: A New Perspective on Asymmetric Least Squares Abstract: Quantiles and expectiles of a distribution are found to be useful descriptors of its tail in the same way as the median and mean are related to its central behavior. This article considers a valuable alternative class to expectiles, called extremiles, which parallels the class of quantiles and includes the family of expected minima and expected maxima. The new class is motivated via several angles, which reveals its specific merits and strengths. Extremiles suggest better capability of fitting both location and spread in data points and provide an appropriate theory that better displays the interesting features of long-tailed distributions. We discuss their estimation in the range of the data and beyond the sample maximum. A number of motivating examples are given to illustrate the utility of estimated extremiles in modeling noncentral behavior. There is in particular an interesting connection with coherent measures of risk protection. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1366-1381 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1498348 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1498348 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1366-1381 Template-Type: ReDIF-Article 1.0 Author-Name: J. C. Escanciano Author-X-Name-First: J. C. Author-X-Name-Last: Escanciano Author-Name: S. C. Goh Author-X-Name-First: S. C. Author-X-Name-Last: Goh Title: Quantile-Regression Inference With Adaptive Control of Size Abstract: Regression quantiles have asymptotic variances that depend on the conditional densities of the response variable given regressors. This article develops a new estimate of the asymptotic variance of regression quantiles that leads any resulting Wald-type test or confidence region to behave as well in large samples as its infeasible counterpart in which the true conditional response densities are embedded. We give explicit guidance on implementing the new variance estimator to control adaptively the size of any resulting Wald-type test. Monte Carlo evidence indicates the potential of our approach to deliver powerful tests of heterogeneity of quantile treatment effects in covariates with good size performance over different quantile levels, data-generating processes, and sample sizes. We also include an empirical example. Supplementary material is available online. Journal: Journal of the American Statistical Association Pages: 1382-1393 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1505624 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1505624 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1382-1393 Template-Type: ReDIF-Article 1.0 Author-Name: James E. Johndrow Author-X-Name-First: James E. Author-X-Name-Last: Johndrow Author-Name: Aaron Smith Author-X-Name-First: Aaron Author-X-Name-Last: Smith Author-Name: Natesh Pillai Author-X-Name-First: Natesh Author-X-Name-Last: Pillai Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: MCMC for Imbalanced Categorical Data Abstract: Many modern applications collect highly imbalanced categorical data, with some categories relatively rare. Bayesian hierarchical models combat data sparsity by borrowing information, while also quantifying uncertainty. However, posterior computation presents a fundamental barrier to routine use; a single class of algorithms does not work well in all settings and practitioners waste time trying different types of Markov chain Monte Carlo (MCMC) approaches. This article was motivated by an application to quantitative advertising in which we encountered extremely poor computational performance for data augmentation MCMC algorithms but obtained excellent performance for adaptive Metropolis. To obtain a deeper understanding of this behavior, we derive theoretical results on the computational complexity of commonly used data augmentation algorithms and the Random Walk Metropolis algorithm for highly imbalanced binary data. In this regime, our results show computational complexity of Metropolis is logarithmic in sample size, while data augmentation is polynomial in sample size. The root cause of this poor performance of data augmentation is a discrepancy between the rates at which the target density and MCMC step sizes concentrate. Our methods also show that MCMC algorithms that exhibit a similar discrepancy will fail in large samples—a result with substantial practical impact. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1394-1403 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1505626 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1505626 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1394-1403 Template-Type: ReDIF-Article 1.0 Author-Name: Wensheng Zhu Author-X-Name-First: Wensheng Author-X-Name-Last: Zhu Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Rui Song Author-X-Name-First: Rui Author-X-Name-Last: Song Title: Proper Inference for Value Function in High-Dimensional Q-Learning for Dynamic Treatment Regimes Abstract: Dynamic treatment regimes are a set of decision rules and each treatment decision is tailored over time according to patients’ responses to previous treatments as well as covariate history. There is a growing interest in development of correct statistical inference for optimal dynamic treatment regimes to handle the challenges of nonregularity problems in the presence of nonrespondents who have zero-treatment effects, especially when the dimension of the tailoring variables is high. In this article, we propose a high-dimensional Q-learning (HQ-learning) to facilitate the inference of optimal values and parameters. The proposed method allows us to simultaneously estimate the optimal dynamic treatment regimes and select the important variables that truly contribute to the individual reward. At the same time, hard thresholding is introduced in the method to eliminate the effects of the nonrespondents. The asymptotic properties for the parameter estimators as well as the estimated optimal value function are then established by adjusting the bias due to thresholding. Both simulation studies and real data analysis demonstrate satisfactory performance for obtaining the proper inference for the value function for the optimal dynamic treatment regimes. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1404-1417 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2018.1506341 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1506341 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1404-1417 Template-Type: ReDIF-Article 1.0 Author-Name: Hongying Dai Author-X-Name-First: Hongying Author-X-Name-Last: Dai Title: Asymptotic Analysis of Mixed Effects Models: Theory, Applications, and Open Problems Journal: Journal of the American Statistical Association Pages: 1418-1420 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2019.1662242 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1662242 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1418-1420 Template-Type: ReDIF-Article 1.0 Author-Name: Kaixian Yu Author-X-Name-First: Kaixian Author-X-Name-Last: Yu Title: Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics. Journal: Journal of the American Statistical Association Pages: 1420-1421 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2019.1662241 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1662241 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1420-1421 Template-Type: ReDIF-Article 1.0 Author-Name: Shu Yang Author-X-Name-First: Shu Author-X-Name-Last: Yang Title: Flexible Imputation of Missing Data, 2nd ed. Journal: Journal of the American Statistical Association Pages: 1421-1421 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2019.1662249 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1662249 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1421-1421 Template-Type: ReDIF-Article 1.0 Author-Name: Ofer Harel Author-X-Name-First: Ofer Author-X-Name-Last: Harel Title: Missing and Modified Data in Nonparametric Estimation: With R Examples. Journal: Journal of the American Statistical Association Pages: 1421-1423 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2019.1662248 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1662248 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1421-1423 Template-Type: ReDIF-Article 1.0 Author-Name: Anna Snavely Author-X-Name-First: Anna Author-X-Name-Last: Snavely Title: Randomization, Masking, and Allocation Concealment. Journal: Journal of the American Statistical Association Pages: 1423-1424 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2019.1662247 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1662247 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1423-1424 Template-Type: ReDIF-Article 1.0 Author-Name: Chen Zhou Author-X-Name-First: Chen Author-X-Name-Last: Zhou Title: Risk Theory: A Heavy Tail Approach. Journal: Journal of the American Statistical Association Pages: 1424-1425 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2019.1662244 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1662244 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1424-1425 Template-Type: ReDIF-Article 1.0 Author-Name: Dootika Vats Author-X-Name-First: Dootika Author-X-Name-Last: Vats Title: Simulation and the Monte Carlo Method, 3rd ed. Journal: Journal of the American Statistical Association Pages: 1425-1425 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2019.1662243 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1662243 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1425-1425 Template-Type: ReDIF-Article 1.0 Author-Name: Jae-Kwang Kim Author-X-Name-First: Jae-Kwang Author-X-Name-Last: Kim Title: Statistical Data Fusion Journal: Journal of the American Statistical Association Pages: 1425-1426 Issue: 527 Volume: 114 Year: 2019 Month: 7 X-DOI: 10.1080/01621459.2019.1662245 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1662245 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1425-1426 Template-Type: ReDIF-Article 1.0 Author-Name: Ganggang Xu Author-X-Name-First: Ganggang Author-X-Name-Last: Xu Author-Name: Rasmus Waagepetersen Author-X-Name-First: Rasmus Author-X-Name-Last: Waagepetersen Author-Name: Yongtao Guan Author-X-Name-First: Yongtao Author-X-Name-Last: Guan Title: Stochastic Quasi-Likelihood for Case-Control Point Pattern Data Abstract: We propose a novel stochastic quasi-likelihood estimation procedure for case-control point processes. Quasi-likelihood for point processes depends on a certain optimal weight function and for the new method the weight function is stochastic since it depends on the control point pattern. The new procedure also provides a computationally efficient implementation of quasi-likelihood for univariate point processes in which case a synthetic control point process is simulated by the user. Under mild conditions, the proposed approach yields consistent and asymptotically normal parameter estimators. We further show that the estimators are optimal in the sense that the associated Godambe information is maximal within a wide class of estimating functions for case-control point processes. The effectiveness of the proposed method is further illustrated using extensive simulation studies and two data examples. Journal: Journal of the American Statistical Association Pages: 631-644 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2017.1421543 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1421543 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:631-644 Template-Type: ReDIF-Article 1.0 Author-Name: Edward H. Kennedy Author-X-Name-First: Edward H. Author-X-Name-Last: Kennedy Title: Nonparametric Causal Effects Based on Incremental Propensity Score Interventions Abstract: Most work in causal inference considers deterministic interventions that set each unit’s treatment to some fixed value. However, under positivity violations these interventions can lead to nonidentification, inefficiency, and effects with little practical relevance. Further, corresponding effects in longitudinal studies are highly sensitive to the curse of dimensionality, resulting in widespread use of unrealistic parametric models. We propose a novel solution to these problems: incremental interventions that shift propensity score values rather than set treatments to fixed values. Incremental interventions have several crucial advantages. First, they avoid positivity assumptions entirely. Second, they require no parametric assumptions and yet still admit a simple characterization of longitudinal effects, independent of the number of timepoints. For example, they allow longitudinal effects to be visualized with a single curve instead of lists of coefficients. After characterizing incremental interventions and giving identifying conditions for corresponding effects, we also develop general efficiency theory, propose efficient nonparametric estimators that can attain fast convergence rates even when incorporating flexible machine learning, and propose a bootstrap-based confidence band and simultaneous test of no treatment effect. Finally, we explore finite-sample performance via simulation, and apply the methods to study time-varying sociological effects of incarceration on entry into marriage. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 645-656 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2017.1422737 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1422737 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:645-656 Template-Type: ReDIF-Article 1.0 Author-Name: Leqin Wu Author-X-Name-First: Leqin Author-X-Name-Last: Wu Author-Name: Xing Qiu Author-X-Name-First: Xing Author-X-Name-Last: Qiu Author-Name: Ya-xiang Yuan Author-X-Name-First: Ya-xiang Author-X-Name-Last: Yuan Author-Name: Hulin Wu Author-X-Name-First: Hulin Author-X-Name-Last: Wu Title: Parameter Estimation and Variable Selection for Big Systems of Linear Ordinary Differential Equations: A Matrix-Based Approach Abstract: Ordinary differential equations (ODEs) are widely used to model the dynamic behavior of a complex system. Parameter estimation and variable selection for a “Big System” with linear ODEs are very challenging due to the need of nonlinear optimization in an ultra-high dimensional parameter space. In this article, we develop a parameter estimation and variable selection method based on the ideas of similarity transformation and separable least squares (SLS). Simulation studies demonstrate that the proposed matrix-based SLS method could be used to estimate the coefficient matrix more accurately and perform variable selection for a linear ODE system with thousands of dimensions and millions of parameters much better than the direct least squares method and the vector-based two-stage method that are currently available. We applied this new method to two real datasets—a yeast cell cycle gene expression dataset with 30 dimensions and 930 unknown parameters and the Standard & Poor 1500 index stock price data with 1250 dimensions and 1,563,750 unknown parameters—to illustrate the utility and numerical performance of the proposed parameter estimation and variable selection method for big systems in practice. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 657-667 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2017.1423074 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1423074 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:657-667 Template-Type: ReDIF-Article 1.0 Author-Name: Michael I. Jordan Author-X-Name-First: Michael I. Author-X-Name-Last: Jordan Author-Name: Jason D. Lee Author-X-Name-First: Jason D. Author-X-Name-Last: Lee Author-Name: Yun Yang Author-X-Name-First: Yun Author-X-Name-Last: Yang Title: Communication-Efficient Distributed Statistical Inference Abstract: We present a communication-efficient surrogate likelihood (CSL) framework for solving distributed statistical inference problems. CSL provides a communication-efficient surrogate to the global likelihood that can be used for low-dimensional estimation, high-dimensional regularized estimation, and Bayesian inference. For low-dimensional estimation, CSL provably improves upon naive averaging schemes and facilitates the construction of confidence intervals. For high-dimensional regularized estimation, CSL leads to a minimax-optimal estimator with controlled communication cost. For Bayesian inference, CSL can be used to form a communication-efficient quasi-posterior distribution that converges to the true posterior. This quasi-posterior procedure significantly improves the computational efficiency of Markov chain Monte Carlo (MCMC) algorithms even in a nondistributed setting. We present both theoretical analysis and experiments to explore the properties of the CSL approximation. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 668-681 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1429274 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1429274 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:668-681 Template-Type: ReDIF-Article 1.0 Author-Name: Michael Hornstein Author-X-Name-First: Michael Author-X-Name-Last: Hornstein Author-Name: Roger Fan Author-X-Name-First: Roger Author-X-Name-Last: Fan Author-Name: Kerby Shedden Author-X-Name-First: Kerby Author-X-Name-Last: Shedden Author-Name: Shuheng Zhou Author-X-Name-First: Shuheng Author-X-Name-Last: Zhou Title: Joint Mean and Covariance Estimation with Unreplicated Matrix-Variate Data Abstract: It has been proposed that complex populations, such as those that arise in genomics studies, may exhibit dependencies among observations as well as among variables. This gives rise to the challenging problem of analyzing unreplicated high-dimensional data with unknown mean and dependence structures. Matrix-variate approaches that impose various forms of (inverse) covariance sparsity allow flexible dependence structures to be estimated, but cannot directly be applied when the mean and covariance matrices are estimated jointly. We present a practical method utilizing generalized least squares and penalized (inverse) covariance estimation to address this challenge. We establish consistency and obtain rates of convergence for estimating the mean parameters and covariance matrices. The advantages of our approaches are: (i) dependence graphs and covariance structures can be estimated in the presence of unknown mean structure, (ii) the mean structure becomes more efficiently estimated when accounting for the dependence structure among observations; and (iii) inferences about the mean parameters become correctly calibrated. We use simulation studies and analysis of genomic data from a twin study of ulcerative colitis to illustrate the statistical convergence and the performance of our methods in practical settings. Several lines of evidence show that the test statistics for differential gene expression produced by our methods are correctly calibrated and improve power over conventional methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 682-696 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1429275 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1429275 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:682-696 Template-Type: ReDIF-Article 1.0 Author-Name: Ryan J. Tibshirani Author-X-Name-First: Ryan J. Author-X-Name-Last: Tibshirani Author-Name: Saharon Rosset Author-X-Name-First: Saharon Author-X-Name-Last: Rosset Title: Excess Optimism: How Biased is the Apparent Error of an Estimator Tuned by SURE? Abstract: Nearly all estimators in statistical prediction come with an associated tuning parameter, in one way or another. Common practice, given data, is to choose the tuning parameter value that minimizes a constructed estimate of the prediction error of the estimator; we focus on Stein’s unbiased risk estimator, or SURE, which forms an unbiased estimate of the prediction error by augmenting the observed training error with an estimate of the degrees of freedom of the estimator. Parameter tuning via SURE minimization has been advocated by many authors, in a wide variety of problem settings, and in general, it is natural to ask: what is the prediction error of the SURE-tuned estimator? An obvious strategy would be simply use the apparent error estimate as reported by SURE, that is, the value of the SURE criterion at its minimum, to estimate the prediction error of the SURE-tuned estimator. But this is no longer unbiased; in fact, we would expect that the minimum of the SURE criterion is systematically biased downwards for the true prediction error. In this work, we define the excess optimism of the SURE-tuned estimator to be the amount of this downward bias in the SURE minimum. We argue that the following two properties motivate the study of excess optimism: (i) an unbiased estimate of excess optimism, added to the SURE criterion at its minimum, gives an unbiased estimate of the prediction error of the SURE-tuned estimator; (ii) excess optimism serves as an upper bound on the excess risk, that is, the difference between the risk of the SURE-tuned estimator and the oracle risk (where the oracle uses the best fixed tuning parameter choice). We study excess optimism in two common settings: shrinkage estimators and subset regression estimators. Our main results include a James–Stein-like property of the SURE-tuned shrinkage estimator, which is shown to dominate the MLE; and both upper and lower bounds on excess optimism for SURE-tuned subset regression. In the latter setting, when the collection of subsets is nested, our bounds are particularly tight, and reveal that in the case of no signal, the excess optimism is always in between 0 and 10 degrees of freedom, regardless of how many models are being selected from. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 697-712 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1429276 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1429276 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:697-712 Template-Type: ReDIF-Article 1.0 Author-Name: Qingyuan Zhao Author-X-Name-First: Qingyuan Author-X-Name-Last: Zhao Title: On Sensitivity Value of Pair-Matched Observational Studies Abstract: This article proposes a new quantity called the “sensitivity value,” which is defined as the minimum strength of unmeasured confounders needed to change the qualitative conclusions of a naive analysis assuming no unmeasured confounder. We establish the asymptotic normality of the sensitivity value in pair-matched observational studies. The theoretical results are then used to approximate the power of a sensitivity analysis and select the design of a study. We explore the potential to use sensitivity values to screen multiple hypotheses in the presence of unmeasured confounding using a microarray dataset. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 713-722 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1429277 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1429277 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:713-722 Template-Type: ReDIF-Article 1.0 Author-Name: Benjamin Frot Author-X-Name-First: Benjamin Author-X-Name-Last: Frot Author-Name: Luke Jostins Author-X-Name-First: Luke Author-X-Name-Last: Jostins Author-Name: Gilean McVean Author-X-Name-First: Gilean Author-X-Name-Last: McVean Title: Graphical Model Selection for Gaussian Conditional Random Fields in the Presence of Latent Variables Abstract: We consider the problem of learning a conditional Gaussian graphical model in the presence of latent variables. Building on recent advances in this field, we suggest a method that decomposes the parameters of a conditional Markov random field into the sum of a sparse and a low-rank matrix. We derive convergence bounds for this estimator and show that it is well-behaved in the high-dimensional regime as well as “sparsistent” (i.e., capable of recovering the graph structure). We then show how proximal gradient algorithms and semi-definite programming techniques can be employed to fit the model to thousands of variables. Through extensive simulations, we illustrate the conditions required for identifiability and show that there is a wide range of situations in which this model performs significantly better than its counterparts, for example, by accommodating more latent variables. Finally, the suggested method is applied to two datasets comprising individual level data on genetic variants and metabolites levels. We show our results replicate better than alternative approaches and show enriched biological signal. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 723-734 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1434531 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1434531 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:723-734 Template-Type: ReDIF-Article 1.0 Author-Name: Satyajit Ghosh Author-X-Name-First: Satyajit Author-X-Name-Last: Ghosh Author-Name: Kshitij Khare Author-X-Name-First: Kshitij Author-X-Name-Last: Khare Author-Name: George Michailidis Author-X-Name-First: George Author-X-Name-Last: Michailidis Title: High-Dimensional Posterior Consistency in Bayesian Vector Autoregressive Models Abstract: Vector autoregressive (VAR) models aim to capture linear temporal interdependencies among multiple time series. They have been widely used in macroeconomics and financial econometrics and more recently have found novel applications in functional genomics and neuroscience. These applications have also accentuated the need to investigate the behavior of the VAR model in a high-dimensional regime, which provides novel insights into the role of temporal dependence for regularized estimates of the model’s parameters. However, hardly anything is known regarding properties of the posterior distribution for Bayesian VAR models in such regimes. In this work, we consider a VAR model with two prior choices for the autoregressive coefficient matrix: a nonhierarchical matrix-normal prior and a hierarchical prior, which corresponds to an arbitrary scale mixture of normals. We establish posterior consistency for both these priors under standard regularity assumptions, when the dimension p of the VAR model grows with the sample size n (but still remains smaller than n). A special case corresponds to a shrinkage prior that introduces (group) sparsity in the columns of the model coefficient matrices. The performance of the model estimates are illustrated on synthetic and real macroeconomic datasets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 735-748 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1437043 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1437043 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:735-748 Template-Type: ReDIF-Article 1.0 Author-Name: Alexandre Belloni Author-X-Name-First: Alexandre Author-X-Name-Last: Belloni Author-Name: Victor Chernozhukov Author-X-Name-First: Victor Author-X-Name-Last: Chernozhukov Author-Name: Kengo Kato Author-X-Name-First: Kengo Author-X-Name-Last: Kato Title: Valid Post-Selection Inference in High-Dimensional Approximately Sparse Quantile Regression Models Abstract: This work proposes new inference methods for a regression coefficient of interest in a (heterogenous) quantile regression model. We consider a high-dimensional model where the number of regressors potentially exceeds the sample size but a subset of them suffices to construct a reasonable approximation to the conditional quantile function. The proposed methods are (explicitly or implicitly) based on orthogonal score functions that protect against moderate model selection mistakes, which are often inevitable in the approximately sparse model considered in the present article. We establish the uniform validity of the proposed confidence regions for the quantile regression coefficient. Importantly, these methods directly apply to more than one variable and a continuum of quantile indices. In addition, the performance of the proposed methods is illustrated through Monte Carlo experiments and an empirical example, dealing with risk factors in childhood malnutrition. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 749-758 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1442339 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442339 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:749-758 Template-Type: ReDIF-Article 1.0 Author-Name: Yuanpei Cao Author-X-Name-First: Yuanpei Author-X-Name-Last: Cao Author-Name: Wei Lin Author-X-Name-First: Wei Author-X-Name-Last: Lin Author-Name: Hongzhe Li Author-X-Name-First: Hongzhe Author-X-Name-Last: Li Title: Large Covariance Estimation for Compositional Data Via Composition-Adjusted Thresholding Abstract: High-dimensional compositional data arise naturally in many applications such as metagenomic data analysis. The observed data lie in a high-dimensional simplex, and conventional statistical methods often fail to produce sensible results due to the unit-sum constraint. In this article, we address the problem of covariance estimation for high-dimensional compositional data and introduce a composition-adjusted thresholding (COAT) method under the assumption that the basis covariance matrix is sparse. Our method is based on a decomposition relating the compositional covariance to the basis covariance, which is approximately identifiable as the dimensionality tends to infinity. The resulting procedure can be viewed as thresholding the sample centered log-ratio covariance matrix and hence is scalable for large covariance matrices. We rigorously characterize the identifiability of the covariance parameters, derive rates of convergence under the spectral norm, and provide theoretical guarantees on support recovery. Simulation studies demonstrate that the COAT estimator outperforms some existing optimization-based estimators. We apply the proposed method to the analysis of a microbiome dataset to understand the dependence structure among bacterial taxa in the human gut. Journal: Journal of the American Statistical Association Pages: 759-772 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1442340 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442340 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:759-772 Template-Type: ReDIF-Article 1.0 Author-Name: Zhenguo Gao Author-X-Name-First: Zhenguo Author-X-Name-Last: Gao Author-Name: Zuofeng Shang Author-X-Name-First: Zuofeng Author-X-Name-Last: Shang Author-Name: Pang Du Author-X-Name-First: Pang Author-X-Name-Last: Du Author-Name: John L. Robertson Author-X-Name-First: John L. Author-X-Name-Last: Robertson Title: Variance Change Point Detection Under a Smoothly-Changing Mean Trend with Application to Liver Procurement Abstract: Literature on change point analysis mostly requires a sudden change in the data distribution, either in a few parameters or the distribution as a whole. We are interested in the scenario, where the variance of data may make a significant jump while the mean changes in a smooth fashion. The motivation is a liver procurement experiment monitoring organ surface temperature. Blindly applying the existing methods to the example can yield erroneous change point estimates since the smoothly changing mean violates the sudden-change assumption. We propose a penalized weighted least-squares approach with an iterative estimation procedure that integrates variance change point detection and smooth mean function estimation. The procedure starts with a consistent initial mean estimate ignoring the variance heterogeneity. Given the variance components the mean function is estimated by smoothing splines as the minimizer of the penalized weighted least squares. Given the mean function, we propose a likelihood ratio test statistic for identifying the variance change point. The null distribution of the test statistic is derived together with the rates of convergence of all the parameter estimates. Simulations show excellent performance of the proposed method. Application analysis offers numerical support to non invasive organ viability assessment by surface temperature monitoring. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 773-781 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1442341 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442341 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:773-781 Template-Type: ReDIF-Article 1.0 Author-Name: Jacob Bien Author-X-Name-First: Jacob Author-X-Name-Last: Bien Title: Graph-Guided Banding of the Covariance Matrix Abstract: Regularization has become a primary tool for developing reliable estimators of the covariance matrix in high-dimensional settings. To curb the curse of dimensionality, numerous methods assume that the population covariance (or inverse covariance) matrix is sparse, while making no particular structural assumptions on the desired pattern of sparsity. A highly-related, yet complementary, literature studies the specific setting in which the measured variables have a known ordering, in which case a banded population matrix is often assumed. While the banded approach is conceptually and computationally easier than asking for “patternless sparsity,” it is only applicable in very specific situations (such as when data are measured over time or one-dimensional space). This work proposes a generalization of the notion of bandedness that greatly expands the range of problems in which banded estimators apply. We develop convex regularizers occupying the broad middle ground between the former approach of “patternless sparsity” and the latter reliance on having a known ordering. Our framework defines bandedness with respect to a known graph on the measured variables. Such a graph is available in diverse situations, and we provide a theoretical, computational, and applied treatment of two new estimators. An R package, called ggb, implements these new methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 782-792 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1442720 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442720 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:782-792 Template-Type: ReDIF-Article 1.0 Author-Name: Prosper Dovonon Author-X-Name-First: Prosper Author-X-Name-Last: Dovonon Author-Name: Sílvia Gonçalves Author-X-Name-First: Sílvia Author-X-Name-Last: Gonçalves Author-Name: Ulrich Hounyo Author-X-Name-First: Ulrich Author-X-Name-Last: Hounyo Author-Name: Nour Meddahi Author-X-Name-First: Nour Author-X-Name-Last: Meddahi Title: Bootstrapping High-Frequency Jump Tests Abstract: The main contribution of this article is to propose a bootstrap test for jumps based on functions of realized volatility and bipower variation. Bootstrap intraday returns are randomly generated from a mean zero Gaussian distribution with a variance given by a local measure of integrated volatility (which we denote by {v^in}$\lbrace \hat{v}_{i}^{n}\rbrace $). We first discuss a set of high-level conditions on {v^in}$\lbrace \hat{v}_{i}^{n}\rbrace $ such that any bootstrap test of this form has the correct asymptotic size and is alternative-consistent. We then provide a set of primitive conditions that justify the choice of a thresholding-based estimator for {v^in}$\lbrace \hat{v}_{i}^{n}\rbrace $. Our cumulant expansions show that the bootstrap is unable to mimic the higher-order bias of the test statistic. We propose a modification of the original bootstrap test which contains an appropriate bias correction term and for which second-order asymptotic refinements are obtained. Journal: Journal of the American Statistical Association Pages: 793-803 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1447485 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1447485 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:793-803 Template-Type: ReDIF-Article 1.0 Author-Name: Shanika L. Wickramasuriya Author-X-Name-First: Shanika L. Author-X-Name-Last: Wickramasuriya Author-Name: George Athanasopoulos Author-X-Name-First: George Author-X-Name-Last: Athanasopoulos Author-Name: Rob J. Hyndman Author-X-Name-First: Rob J. Author-X-Name-Last: Hyndman Title: Optimal Forecast Reconciliation for Hierarchical and Grouped Time Series Through Trace Minimization Abstract: Large collections of time series often have aggregation constraints due to product or geographical groupings. The forecasts for the most disaggregated series are usually required to add-up exactly to the forecasts of the aggregated series, a constraint we refer to as “coherence.” Forecast reconciliation is the process of adjusting forecasts to make them coherent.The reconciliation algorithm proposed by Hyndman et al. (2011) is based on a generalized least squares estimator that requires an estimate of the covariance matrix of the coherency errors (i.e., the errors that arise due to incoherence). We show that this matrix is impossible to estimate in practice due to identifiability conditions.We propose a new forecast reconciliation approach that incorporates the information from a full covariance matrix of forecast errors in obtaining a set of coherent forecasts. Our approach minimizes the mean squared error of the coherent forecasts across the entire collection of time series under the assumption of unbiasedness. The minimization problem has a closed-form solution. We make this solution scalable by providing a computationally efficient representation.We evaluate the performance of the proposed method compared to alternative methods using a series of simulation designs which take into account various features of the collected time series. This is followed by an empirical application using Australian domestic tourism data. The results indicate that the proposed method works well with artificial and real data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 804-819 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1448825 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1448825 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:804-819 Template-Type: ReDIF-Article 1.0 Author-Name: Simon N. Vandekar Author-X-Name-First: Simon N. Author-X-Name-Last: Vandekar Author-Name: Philip T. Reiss Author-X-Name-First: Philip T. Author-X-Name-Last: Reiss Author-Name: Russell T. Shinohara Author-X-Name-First: Russell T. Author-X-Name-Last: Shinohara Title: Interpretable High-Dimensional Inference Via Score Projection With an Application in Neuroimaging Abstract: In the fields of neuroimaging and genetics, a key goal is testing the association of a single outcome with a very high-dimensional imaging or genetic variable. Often, summary measures of the high-dimensional variable are created to sequentially test and localize the association with the outcome. In some cases, the associations between the outcome and summary measures are significant, but subsequent tests used to localize differences are underpowered and do not identify regions associated with the outcome. Here, we propose a generalization of Rao’s score test based on projecting the score statistic onto a linear subspace of a high-dimensional parameter space. The approach provides a way to localize signal in the high-dimensional space by projecting the scores to the subspace where the score test was performed. This allows for inference in the high-dimensional space to be performed on the same degrees of freedom as the score test, effectively reducing the number of comparisons. Simulation results demonstrate the test has competitive power relative to others commonly used. We illustrate the method by analyzing a subset of the Alzheimer’s Disease Neuroimaging Initiative dataset. Results suggest cortical thinning of the frontal and temporal lobes may be a useful biological marker of Alzheimer’s disease risk. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 820-830 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1448826 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1448826 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:820-830 Template-Type: ReDIF-Article 1.0 Author-Name: Matias Quiroz Author-X-Name-First: Matias Author-X-Name-Last: Quiroz Author-Name: Robert Kohn Author-X-Name-First: Robert Author-X-Name-Last: Kohn Author-Name: Mattias Villani Author-X-Name-First: Mattias Author-X-Name-Last: Villani Author-Name: Minh-Ngoc Tran Author-X-Name-First: Minh-Ngoc Author-X-Name-Last: Tran Title: Speeding Up MCMC by Efficient Data Subsampling Abstract: We propose subsampling Markov chain Monte Carlo (MCMC), an MCMC framework where the likelihood function for n observations is estimated from a random subset of m observations. We introduce a highly efficient unbiased estimator of the log-likelihood based on control variates, such that the computing cost is much smaller than that of the full log-likelihood in standard MCMC. The likelihood estimate is bias-corrected and used in two dependent pseudo-marginal algorithms to sample from a perturbed posterior, for which we derive the asymptotic error with respect to n and m, respectively. We propose a practical estimator of the error and show that the error is negligible even for a very small m in our applications. We demonstrate that subsampling MCMC is substantially more efficient than standard MCMC in terms of sampling efficiency for a given computational budget, and that it outperforms other subsampling methods for MCMC proposed in the literature. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 831-843 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1448827 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1448827 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:831-843 Template-Type: ReDIF-Article 1.0 Author-Name: Simon Mak Author-X-Name-First: Simon Author-X-Name-Last: Mak Author-Name: C. F. Jeff Wu Author-X-Name-First: C. F. Jeff Author-X-Name-Last: Wu Title: cmenet: A New Method for Bi-Level Variable Selection of Conditional Main Effects Abstract: This article introduces a novel method for selecting main effects and a set of reparameterized effects called conditional main effects (CMEs), which capture the conditional effect of a factor at a fixed level of another factor. CMEs represent interpretable, domain-specific phenomena for a wide range of applications in engineering, social sciences, and genomics. The key challenge is in incorporating the implicit grouped structure of CMEs within the variable selection procedure itself. We propose a new method, cmenet, which employs two principles called CME coupling and CME reduction to effectively navigate the selection algorithm. Simulation studies demonstrate the improved CME selection performance of cmenet over more generic selection methods. Applied to a gene association study on fly wing shape, cmenet not only yields more parsimonious models and improved predictive performance over standard two-factor interaction analysis methods, but also reveals important insights on gene activation behavior, which can be used to guide further experiments. Efficient implementations of our algorithms are available in the R package cmenet in CRAN. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 844-856 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1448828 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1448828 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:844-856 Template-Type: ReDIF-Article 1.0 Author-Name: Ting Yan Author-X-Name-First: Ting Author-X-Name-Last: Yan Author-Name: Binyan Jiang Author-X-Name-First: Binyan Author-X-Name-Last: Jiang Author-Name: Stephen E. Fienberg Author-X-Name-First: Stephen E. Author-X-Name-Last: Fienberg Author-Name: Chenlei Leng Author-X-Name-First: Chenlei Author-X-Name-Last: Leng Title: Statistical Inference in a Directed Network Model With Covariates Abstract: Networks are often characterized by node heterogeneity for which nodes exhibit different degrees of interaction and link homophily for which nodes sharing common features tend to associate with each other. In this article, we rigorously study a directed network model that captures the former via node-specific parameterization and the latter by incorporating covariates. In particular, this model quantifies the extent of heterogeneity in terms of outgoingness and incomingness of each node by different parameters, thus allowing the number of heterogeneity parameters to be twice the number of nodes. We study the maximum likelihood estimation of the model and establish the uniform consistency and asymptotic normality of the resulting estimators. Numerical studies demonstrate our theoretical findings and two data analyses confirm the usefulness of our model. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 857-868 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1448829 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1448829 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:857-868 Template-Type: ReDIF-Article 1.0 Author-Name: Likai Chen Author-X-Name-First: Likai Author-X-Name-Last: Chen Author-Name: Wei Biao Wu Author-X-Name-First: Wei Biao Author-X-Name-Last: Wu Title: Testing for Trends in High-Dimensional Time Series Abstract: The article considers statistical inference for trends of high-dimensional time series. Based on a modified L2$\mathcal {L}^2$ distance between parametric and nonparametric trend estimators, we propose a de-diagonalized quadratic form test statistic for testing patterns on trends, such as linear, quadratic, or parallel forms. We develop an asymptotic theory for the test statistic. A Gaussian multiplier testing procedure is proposed and it has an improved finite sample performance. Our testing procedure is applied to a spatial temporal temperature data gathered from various locations across America. A simulation study is also presented to illustrate the performance of our testing method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 869-881 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1456935 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1456935 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:869-881 Template-Type: ReDIF-Article 1.0 Author-Name: Rong Zhu Author-X-Name-First: Rong Author-X-Name-Last: Zhu Author-Name: Alan T. K. Wan Author-X-Name-First: Alan T. K. Author-X-Name-Last: Wan Author-Name: Xinyu Zhang Author-X-Name-First: Xinyu Author-X-Name-Last: Zhang Author-Name: Guohua Zou Author-X-Name-First: Guohua Author-X-Name-Last: Zou Title: A Mallows-Type Model Averaging Estimator for the Varying-Coefficient Partially Linear Model Abstract: In the last decade, significant theoretical advances have been made in the area of frequentist model averaging (FMA); however, the majority of this work has emphasized parametric model setups. This article considers FMA for the semiparametric varying-coefficient partially linear model (VCPLM), which has gained prominence to become an extensively used modeling tool in recent years. Within this context, we develop a Mallows-type criterion for assigning model weights and prove its asymptotic optimality. A simulation study and a real data analysis demonstrate that the FMA estimator that arises from this criterion is vastly preferred to information criterion score-based model selection and averaging estimators. Our analysis is complicated by the fact that the VCPLM is subject to uncertainty arising not only from the choice of covariates, but also whether the covariate should enter the parametric or nonparametric parts of the model. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 882-892 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1456936 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1456936 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:882-892 Template-Type: ReDIF-Article 1.0 Author-Name: Junxian Geng Author-X-Name-First: Junxian Author-X-Name-Last: Geng Author-Name: Anirban Bhattacharya Author-X-Name-First: Anirban Author-X-Name-Last: Bhattacharya Author-Name: Debdeep Pati Author-X-Name-First: Debdeep Author-X-Name-Last: Pati Title: Probabilistic Community Detection With Unknown Number of Communities Abstract: A fundamental problem in network analysis is clustering the nodes into groups which share a similar connectivity pattern. Existing algorithms for community detection assume the knowledge of the number of clusters or estimate it a priori using various selection criteria and subsequently estimate the community structure. Ignoring the uncertainty in the first stage may lead to erroneous clustering, particularly when the community structure is vague. We instead propose a coherent probabilistic framework for simultaneous estimation of the number of communities and the community structure, adapting recently developed Bayesian nonparametric techniques to network models. An efficient Markov chain Monte Carlo (MCMC) algorithm is proposed which obviates the need to perform reversible jump MCMC on the number of clusters. The methodology is shown to outperform recently developed community detection algorithms in a variety of synthetic data examples and in benchmark real-datasets. Using an appropriate metric on the space of all configurations, we develop nonasymptotic Bayes risk bounds even when the number of clusters is unknown. Enroute, we develop concentration properties of nonlinear functions of Bernoulli random variables, which may be of independent interest in analysis of related models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 893-905 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1458618 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1458618 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:893-905 Template-Type: ReDIF-Article 1.0 Author-Name: Éric Lesage Author-X-Name-First: Éric Author-X-Name-Last: Lesage Author-Name: David Haziza Author-X-Name-First: David Author-X-Name-Last: Haziza Author-Name: Xavier D’Haultfœuille Author-X-Name-First: Xavier Author-X-Name-Last: D’Haultfœuille Title: A Cautionary Tale on Instrumental Calibration for the Treatment of Nonignorable Unit Nonresponse in Surveys Abstract: Response rates have been steadily declining over the last decades, making survey estimates vulnerable to nonresponse bias. To reduce the potential bias, two weighting approaches are commonly used in National Statistical Offices: the one-step and the two-step approaches. In this article, we focus on the one-step approach, whereby the design weights are modified in a single step with two simultaneous goals in mind: reduce the nonresponse bias and ensure the consistency between survey estimates and known population totals. In particular, we examine the properties of instrumental calibration, a special case of the one-step approach that has received a lot of attention in the literature in recent years. Despite the rich literature on the topic, there remain some important gaps that this article aims to fill. First, we give a set of sufficient conditions required for establishing the consistency of instrumental calibration estimators. Also, we show that the latter may suffer from a large bias when some of these conditions are violated. Results from a simulation study support our findings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 906-915 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1458619 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1458619 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:906-915 Template-Type: ReDIF-Article 1.0 Author-Name: Rongmao Zhang Author-X-Name-First: Rongmao Author-X-Name-Last: Zhang Author-Name: Peter Robinson Author-X-Name-First: Peter Author-X-Name-Last: Robinson Author-Name: Qiwei Yao Author-X-Name-First: Qiwei Author-X-Name-Last: Yao Title: Identifying Cointegration by Eigenanalysis Abstract: We propose a new and easy-to-use method for identifying cointegrated components of nonstationary time series, consisting of an eigenanalysis for a certain nonnegative definite matrix. Our setting is model-free, and we allow the integer-valued integration orders of the observable series to be unknown, and to possibly differ. Consistency of estimates of the cointegration space and cointegration rank is established both when the dimension of the observable time series is fixed as sample size increases, and when it diverges slowly. The proposed methodology is also extended and justified in a fractional setting. A Monte Carlo study of finite-sample performance, and a small empirical illustration, are reported. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 916-927 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1458620 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1458620 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:916-927 Template-Type: ReDIF-Article 1.0 Author-Name: Wenliang Pan Author-X-Name-First: Wenliang Author-X-Name-Last: Pan Author-Name: Xueqin Wang Author-X-Name-First: Xueqin Author-X-Name-Last: Wang Author-Name: Weinan Xiao Author-X-Name-First: Weinan Author-X-Name-Last: Xiao Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Title: A Generic Sure Independence Screening Procedure Abstract: Extracting important features from ultra-high dimensional data is one of the primary tasks in statistical learning, information theory, precision medicine, and biological discovery. Many of the sure independent screening methods developed to meet these needs are suitable for special models under some assumptions. With the availability of more data types and possible models, a model-free generic screening procedure with fewer and less restrictive assumptions is desirable. In this article, we propose a generic nonparametric sure independence screening procedure, called BCor-SIS, on the basis of a recently developed universal dependence measure: Ball correlation. We show that the proposed procedure has strong screening consistency even when the dimensionality is an exponential order of the sample size without imposing sub-exponential moment assumptions on the data. We investigate the flexibility of this procedure by considering three commonly encountered challenging settings in biological discovery or precision medicine: iterative BCor-SIS, interaction pursuit, and survival outcomes. We use simulation studies and real data analyses to illustrate the versatility and practicability of our BCor-SIS method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 928-937 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1462709 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1462709 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:928-937 Template-Type: ReDIF-Article 1.0 Author-Name: Jessica G. Young Author-X-Name-First: Jessica G. Author-X-Name-Last: Young Author-Name: Roger W. Logan Author-X-Name-First: Roger W. Author-X-Name-Last: Logan Author-Name: James M. Robins Author-X-Name-First: James M. Author-X-Name-Last: Robins Author-Name: Miguel A. Hernán Author-X-Name-First: Miguel A. Author-X-Name-Last: Hernán Title: Inverse Probability Weighted Estimation of Risk Under Representative Interventions in Observational Studies Abstract: Researchers are often interested in using observational data to estimate the effect on a health outcome of maintaining a continuous treatment within a prespecified range over time, for example, “always exercise at least 30 minutes per day.” There may be many precise interventions that could achieve this range. In this article, we consider representative interventions. These are special cases of random dynamic interventions: interventions under which treatment at each time is assigned according to a random draw from a distribution that may depend on a subject’s measured past. Estimators of risk under representative interventions on a time-varying treatment have previously been described based on g-estimation of structural nested cumulative failure time models. In this article, we consider an alternative approach based on inverse probability weighting (IPW) of marginal structural models. In particular, we show that the risk under a representative intervention on a time-varying continuous treatment can be consistently estimated via computationally simple IPW methods traditionally used for deterministic static (i.e., “nonrandom” and “nondynamic”) interventions for binary treatments. We present an application of IPW in this setting to estimate the 28-year risk of coronary heart disease under various representative interventions on lifestyle behaviors in the Nurses' Health Study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 938-947 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1469993 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469993 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:938-947 Template-Type: ReDIF-Article 1.0 Author-Name: Wonyul Lee Author-X-Name-First: Wonyul Author-X-Name-Last: Lee Author-Name: Michelle F. Miranda Author-X-Name-First: Michelle F. Author-X-Name-Last: Miranda Author-Name: Philip Rausch Author-X-Name-First: Philip Author-X-Name-Last: Rausch Author-Name: Veerabhadran Baladandayuthapani Author-X-Name-First: Veerabhadran Author-X-Name-Last: Baladandayuthapani Author-Name: Massimo Fazio Author-X-Name-First: Massimo Author-X-Name-Last: Fazio Author-Name: J. Crawford Downs Author-X-Name-First: J. Crawford Author-X-Name-Last: Downs Author-Name: Jeffrey S. Morris Author-X-Name-First: Jeffrey S. Author-X-Name-Last: Morris Title: Bayesian Semiparametric Functional Mixed Models for Serially Correlated Functional Data, With Application to Glaucoma Data Abstract: Glaucoma, a leading cause of blindness, is characterized by optic nerve damage related to intraocular pressure (IOP), but its full etiology is unknown. Researchers at UAB have devised a custom device to measure scleral strain continuously around the eye under fixed levels of IOP, which here is used to assess how strain varies around the posterior pole, with IOP, and across glaucoma risk factors such as age. The hypothesis is that scleral strain decreases with age, which could alter biomechanics of the optic nerve head and cause damage that could eventually lead to glaucoma. To evaluate this hypothesis, we adapted Bayesian Functional Mixed Models to model these complex data consisting of correlated functions on spherical scleral surface, with nonparametric age effects allowed to vary in magnitude and smoothness across the scleral surface, multi-level random effect functions to capture within-subject correlation, and functional growth curve terms to capture serial correlation across IOPs that can vary around the scleral surface. Our method yields fully Bayesian inference on the scleral surface or any aggregation or transformation thereof, and reveals interesting insights into the biomechanical etiology of glaucoma. The general modeling framework described is very flexible and applicable to many complex, high-dimensional functional data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 495-513 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1476242 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1476242 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:495-513 Template-Type: ReDIF-Article 1.0 Author-Name: Oscar Hernan Madrid Padilla Author-X-Name-First: Oscar Hernan Author-X-Name-Last: Madrid Padilla Author-Name: Alex Athey Author-X-Name-First: Alex Author-X-Name-Last: Athey Author-Name: Alex Reinhart Author-X-Name-First: Alex Author-X-Name-Last: Reinhart Author-Name: James G. Scott Author-X-Name-First: James G. Author-X-Name-Last: Scott Title: Sequential Nonparametric Tests for a Change in Distribution: An Application to Detecting Radiological Anomalies Abstract: We propose a sequential nonparametric test for detecting a change in distribution, based on windowed Kolmogorov–Smirnov statistics. The approach is simple, robust, highly computationally efficient, easy to calibrate, and requires no parametric assumptions about the underlying null and alternative distributions. We show that both the false-alarm rate and the power of our procedure are amenable to rigorous analysis, and that the method outperforms existing sequential testing procedures in practice. We then apply the method to the problem of detecting radiological anomalies, using data collected from measurements of the background gamma-radiation spectrum on a large university campus. In this context, the proposed method leads to substantial improvements in time-to-detection for the kind of radiological anomalies of interest in law-enforcement and border-security applications.Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 514-528 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1476245 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1476245 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:514-528 Template-Type: ReDIF-Article 1.0 Author-Name: Naoki Egami Author-X-Name-First: Naoki Author-X-Name-Last: Egami Author-Name: Kosuke Imai Author-X-Name-First: Kosuke Author-X-Name-Last: Imai Title: Causal Interaction in Factorial Experiments: Application to Conjoint Analysis Abstract: We study causal interaction in factorial experiments, in which several factors, each with multiple levels, are randomized to form a large number of possible treatment combinations. Examples of such experiments include conjoint analysis, which is often used by social scientists to analyze multidimensional preferences in a population. To characterize the structure of causal interaction in factorial experiments, we propose a new causal interaction effect, called the average marginal interaction effect (AMIE). Unlike the conventional interaction effect, the relative magnitude of the AMIE does not depend on the choice of baseline conditions, making its interpretation intuitive even for higher-order interactions. We show that the AMIE can be nonparametrically estimated using ANOVA regression with weighted zero-sum constraints. Because the AMIEs are invariant to the choice of baseline conditions, we directly regularize them by collapsing levels and selecting factors within a penalized ANOVA framework. This regularized estimation procedure reduces false discovery rate and further facilitates interpretation. Finally, we apply the proposed methodology to the conjoint analysis of ethnic voting behavior in Africa and find clear patterns of causal interaction between politicians’ ethnicity and their prior records. The proposed methodology is implemented in an open source software package. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 529-540 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1476246 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1476246 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:529-540 Template-Type: ReDIF-Article 1.0 Author-Name: Seung Jun Shin Author-X-Name-First: Seung Jun Author-X-Name-Last: Shin Author-Name: Ying Yuan Author-X-Name-First: Ying Author-X-Name-Last: Yuan Author-Name: Louise C. Strong Author-X-Name-First: Louise C. Author-X-Name-Last: Strong Author-Name: Jasmina Bojadzieva Author-X-Name-First: Jasmina Author-X-Name-Last: Bojadzieva Author-Name: Wenyi Wang Author-X-Name-First: Wenyi Author-X-Name-Last: Wang Title: Bayesian Semiparametric Estimation of Cancer-Specific Age-at-Onset Penetrance With Application to Li-Fraumeni Syndrome Abstract: Penetrance, which plays a key role in genetic research, is defined as the proportion of individuals with the genetic variants (i.e., genotype) that cause a particular trait and who have clinical symptoms of the trait (i.e., phenotype). We propose a Bayesian semiparametric approach to estimate the cancer-specific age-at-onset penetrance in the presence of the competing risk of multiple cancers. We employ a Bayesian semiparametric competing risk model to model the duration until individuals in a high-risk group develop different cancers, and accommodate family data using family-wise likelihoods. We tackle the ascertainment bias arising when family data are collected through probands in a high-risk population in which disease cases are more likely to be observed. We apply the proposed method to a cohort of 186 families with Li-Fraumeni syndrome identified through probands with sarcoma treated at MD Anderson Cancer Center from 1944 to 1982. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 541-552 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1482749 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482749 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:541-552 Template-Type: ReDIF-Article 1.0 Author-Name: Lei Huang Author-X-Name-First: Lei Author-X-Name-Last: Huang Author-Name: Jiawei Bai Author-X-Name-First: Jiawei Author-X-Name-Last: Bai Author-Name: Andrada Ivanescu Author-X-Name-First: Andrada Author-X-Name-Last: Ivanescu Author-Name: Tamara Harris Author-X-Name-First: Tamara Author-X-Name-Last: Harris Author-Name: Mathew Maurer Author-X-Name-First: Mathew Author-X-Name-Last: Maurer Author-Name: Philip Green Author-X-Name-First: Philip Author-X-Name-Last: Green Author-Name: Vadim Zipunnikov Author-X-Name-First: Vadim Author-X-Name-Last: Zipunnikov Title: Multilevel Matrix-Variate Analysis and its Application to Accelerometry-Measured Physical Activity in Clinical Populations Abstract: The number of studies where the primary measurement is a matrix is exploding. In response to this, we propose a statistical framework for modeling populations of repeatedly observed matrix-variate measurements. The 2D structure is handled via a matrix-variate distribution with decomposable row/column-specific covariance matrices and a linear mixed effect framework is used to model the multilevel design. The proposed framework flexibly expands to accommodate many common crossed and nested designs and introduces two important concepts: the between-subject distance and intraclass correlation coefficient, both defined for matrix-variate data. The computational feasibility and performance of the approach is shown in extensive simulation studies. The method is motivated by and applied to a study that monitored physical activity of individuals diagnosed with congestive heart failure (CHF) over a 4- to 9-month period. The long-term patterns of physical activity are studied and compared in two CHF subgroups: with and without adverse clinical events. Supplementary materials for this article, that include de-identified accelerometry and clinical data, are available online. Journal: Journal of the American Statistical Association Pages: 553-564 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1482750 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482750 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:553-564 Template-Type: ReDIF-Article 1.0 Author-Name: Domenico Giannone Author-X-Name-First: Domenico Author-X-Name-Last: Giannone Author-Name: Michele Lenza Author-X-Name-First: Michele Author-X-Name-Last: Lenza Author-Name: Giorgio E. Primiceri Author-X-Name-First: Giorgio E. Author-X-Name-Last: Primiceri Title: Priors for the Long Run Abstract: We propose a class of prior distributions that discipline the long-run behavior of vector autoregressions (VARs). These priors can be naturally elicited using economic theory, which provides guidance on the joint dynamics of macroeconomic time series in the long run. Our priors for the long run are conjugate, and can thus be easily implemented using dummy observations and combined with other popular priors. In VARs with standard macroeconomic variables, a prior based on the long-run predictions of a wide class of theoretical models yields substantial improvements in the forecasting performance. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 565-580 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1483826 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1483826 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:565-580 Template-Type: ReDIF-Article 1.0 Author-Name: Xiangyu Luo Author-X-Name-First: Xiangyu Author-X-Name-Last: Luo Author-Name: Yingying Wei Author-X-Name-First: Yingying Author-X-Name-Last: Wei Title: Batch Effects Correction with Unknown Subtypes Abstract: High-throughput experimental data are accumulating exponentially in public databases. Unfortunately, however, mining valid scientific discoveries from these abundant resources is hampered by technical artifacts and inherent biological heterogeneity. The former are usually termed “batch effects,” and the latter is often modeled by subtypes. Existing methods either tackle batch effects provided that subtypes are known or cluster subtypes assuming that batch effects are absent. Consequently, there is a lack of research on the correction of batch effects with the presence of unknown subtypes. Here, we combine a location-and-scale adjustment model and model-based clustering into a novel hybrid one, the batch-effects-correction-with-unknown-subtypes model (BUS). BUS is capable of (a) correcting batch effects explicitly, (b) grouping samples that share similar characteristics into subtypes, (c) identifying features that distinguish subtypes, (d) allowing the number of subtypes to vary from batch to batch, (e) integrating batches from different platforms, and (f) enjoying a linear-order computation complexity. We prove the identifiability of BUS and provide conditions for study designs under which batch effects can be corrected. BUS is evaluated by simulation studies and a real breast cancer dataset combined from three batches measured on two platforms. Results from the breast cancer dataset offer much better biological insights than existing methods. We implement BUS as a free Bioconductor package BUScorrect. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 581-594 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1497494 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1497494 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:581-594 Template-Type: ReDIF-Article 1.0 Author-Name: Yakuan Chen Author-X-Name-First: Yakuan Author-X-Name-Last: Chen Author-Name: Jeff Goldsmith Author-X-Name-First: Jeff Author-X-Name-Last: Goldsmith Author-Name: R. Todd Ogden Author-X-Name-First: R. Todd Author-X-Name-Last: Ogden Title: Functional Data Analysis of Dynamic PET Data Abstract: One application of positron emission tomography (PET), a nuclear imaging technique, in neuroscience involves in vivo estimation of the density of various proteins (often, neuroreceptors) in the brain. PET scanning begins with the injection of a radiolabeled tracer that binds preferentially to the target protein; tracer molecules are then continuously delivered to the brain via the bloodstream. By detecting the radioactive decay of the tracer over time, dynamic PET data are constructed to reflect the concentration of the target protein in the brain at each time. The fundamental problem in the analysis of dynamic PET data involves estimating the impulse response function (IRF), which is necessary for describing the binding behavior of the injected radiotracer. Virtually all existing methods have three common aspects: summarizing the entire IRF with a single scalar measure; modeling each subject separately; and the imposition of parametric restrictions on the IRF. In contrast, we propose a functional data analytic approach that regards each subject’s IRF as the basic analysis unit, models multiple subjects simultaneously, and estimates the IRF nonparametrically. We pose our model as a linear mixed effect model in which population level fixed effects and subject-specific random effects are expanded using a B-spline basis. Shrinkage and roughness penalties are incorporated in the model to enforce identifiability and smoothness of the estimated curves, respectively, while monotonicity and nonnegativity constraints impose biological information on estimates. We illustrate this approach by applying it to clinical PET data with subjects belonging to three diagnosic groups. We explore differences among groups by means of pointwise confidence intervals of the estimated mean curves based on bootstrap samples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 595-609 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1497495 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1497495 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:595-609 Template-Type: ReDIF-Article 1.0 Author-Name: Will Landau Author-X-Name-First: Will Author-X-Name-Last: Landau Author-Name: Jarad Niemi Author-X-Name-First: Jarad Author-X-Name-Last: Niemi Author-Name: Dan Nettleton Author-X-Name-First: Dan Author-X-Name-Last: Nettleton Title: Fully Bayesian Analysis of RNA-seq Counts for the Detection of Gene Expression Heterosis Abstract: Heterosis, or hybrid vigor, is the enhancement of the phenotype of hybrid progeny relative to their inbred parents. Heterosis is extensively used in agriculture, and the underlying mechanisms are unclear. To investigate the molecular basis of phenotypic heterosis, researchers search tens of thousands of genes for heterosis with respect to expression in the transcriptome. Difficulty arises in the assessment of heterosis due to composite null hypotheses and nonuniform distributions for p-values under these null hypotheses. Thus, we develop a general hierarchical model for count data and a fully Bayesian analysis in which an efficient parallelized Markov chain Monte Carlo algorithm ameliorates the computational burden. We use our method to detect gene expression heterosis in a two-hybrid plant-breeding scenario, both in a real RNA-seq maize dataset and in simulation studies. In the simulation studies, we show our method has well-calibrated posterior probabilities and credible intervals when the model assumed in analysis matches the model used to simulate the data. Although model misspecification can adversely affect calibration, the methodology is still able to accurately rank genes. Finally, we show that hyperparameter posteriors are extremely narrow and an empirical Bayes (eBayes) approach based on posterior means from the fully Bayesian analysis provides virtually equivalent posterior probabilities, credible intervals, and gene rankings relative to the fully Bayesian solution. This evidence of equivalence provides support for the use of eBayes procedures in RNA-seq data analysis if accurate hyperparameter estimates can be obtained. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 610-621 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1497496 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1497496 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:610-621 Template-Type: ReDIF-Article 1.0 Author-Name: Yifei Wang Author-X-Name-First: Yifei Author-X-Name-Last: Wang Author-Name: Daniel J. Tancredi Author-X-Name-First: Daniel J. Author-X-Name-Last: Tancredi Author-Name: Diana L. Miglioretti Author-X-Name-First: Diana L. Author-X-Name-Last: Miglioretti Title: Joint Indirect Standardization When Only Marginal Distributions are Observed in the Index Population Abstract: It is a common interest in medicine to determine whether a hospital meets a benchmark created from an aggregate reference population, after accounting for differences in distributions of multiple covariates. Due to the difficulties of collecting individual-level data, however, it is often the case that only marginal distributions of the covariates are available, making covariate-adjusted comparison challenging. We propose and evaluate a novel approach for conducting indirect standardization when only marginal covariate distributions of the studied hospital are known, but complete information is available for the reference hospitals. We do this with the aid of two existing methods: iterative proportional fit, which estimates the cells of a contingency table when only marginal sums are known, and synthetic control methods, which create a counterfactual control group using a weighted combination of potential control groups. The proper application of these existing methods for indirect standardization would require accounting for the statistical uncertainties induced by a situation where no individual-level data are collected from the studied population. We address this need with a novel method which uses a random Dirichlet parameterization of the synthetic control weights to estimate uncertainty intervals for the standard incidence ratio. We demonstrate our novel methods by estimating hospital-level standardized incidence ratios for comparing the adjusted probability of computed tomography examinations with high radiations doses, relative to a reference standard and we evaluate out methods in a simulation study. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 622-630 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2018.1506340 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1506340 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:622-630 Template-Type: ReDIF-Article 1.0 Author-Name: Jing Su Author-X-Name-First: Jing Author-X-Name-Last: Su Title: Book Review Journal: Journal of the American Statistical Association Pages: 948-948 Issue: 526 Volume: 114 Year: 2019 Month: 4 X-DOI: 10.1080/01621459.2019.1614762 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1614762 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:948-948 Template-Type: ReDIF-Article 1.0 Author-Name: Clement Lee Author-X-Name-First: Clement Author-X-Name-Last: Lee Author-Name: Darren J. Wilkinson Author-X-Name-First: Darren J. Author-X-Name-Last: Wilkinson Title: A Hierarchical Model of Nonhomogeneous Poisson Processes for Twitter Retweets Abstract: We present a hierarchical model of nonhomogeneous Poisson processes (NHPP) for information diffusion on online social media, in particular Twitter retweets. The retweets of each original tweet are modelled by a NHPP, for which the intensity function is a product of time-decaying components and another component that depends on the follower count of the original tweet author. The latter allows us to explain or predict the ultimate retweet count by a network centrality-related covariate. The inference algorithm enables the Bayes factor to be computed, to facilitate model selection. Finally, the model is applied to the retweet datasets of two hashtags. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement Journal: Journal of the American Statistical Association Pages: 1-15 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2019.1585358 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585358 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:1-15 Template-Type: ReDIF-Article 1.0 Author-Name: Jonathan P. Williams Author-X-Name-First: Jonathan P. Author-X-Name-Last: Williams Author-Name: Curtis B. Storlie Author-X-Name-First: Curtis B. Author-X-Name-Last: Storlie Author-Name: Terry M. Therneau Author-X-Name-First: Terry M. Author-X-Name-Last: Therneau Author-Name: Clifford R. Jack Jr Author-X-Name-First: Clifford R. Jack Author-X-Name-Last: Jr Author-Name: Jan Hannig Author-X-Name-First: Jan Author-X-Name-Last: Hannig Title: A Bayesian Approach to Multistate Hidden Markov Models: Application to Dementia Progression Abstract: People are living longer than ever before, and with this arises new complications and challenges for humanity. Among the most pressing of these challenges is of understanding the role of aging in the development of dementia. This article is motivated by the Mayo Clinic Study of Aging data for 4742 subjects since 2004, and how it can be used to draw inference on the role of aging in the development of dementia. We construct a hidden Markov model (HMM) to represent progression of dementia from states associated with the buildup of amyloid plaque in the brain, and the loss of cortical thickness. A hierarchical Bayesian approach is taken to estimate the parameters of the HMM with a truly time-inhomogeneous infinitesimal generator matrix, and response functions of the continuous-valued biomarker measurements are cut-point agnostic. A Bayesian approach with these features could be useful in many disease progression models. Additionally, an approach is illustrated for correcting a common bias in delayed enrollment studies, in which some or all subjects are not observed at baseline. Standard software is incapable of accounting for this critical feature, so code to perform the estimation of the model described below is made available online. Code submitted with this article was checked by an Associate Editor for Reproducibility and is available as an online supplement. Journal: Journal of the American Statistical Association Pages: 16-31 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2019.1594831 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1594831 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:16-31 Template-Type: ReDIF-Article 1.0 Author-Name: Curtis B. Storlie Author-X-Name-First: Curtis B. Author-X-Name-Last: Storlie Author-Name: Terry M. Therneau Author-X-Name-First: Terry M. Author-X-Name-Last: Therneau Author-Name: Rickey E. Carter Author-X-Name-First: Rickey E. Author-X-Name-Last: Carter Author-Name: Nicholas Chia Author-X-Name-First: Nicholas Author-X-Name-Last: Chia Author-Name: John R. Bergquist Author-X-Name-First: John R. Author-X-Name-Last: Bergquist Author-Name: Jeanne M. Huddleston Author-X-Name-First: Jeanne M. Author-X-Name-Last: Huddleston Author-Name: Santiago Romero-Brufau Author-X-Name-First: Santiago Author-X-Name-Last: Romero-Brufau Title: Prediction and Inference With Missing Data in Patient Alert Systems Abstract: We describe the Bedside Patient Rescue (BPR) project, the goal of which is risk prediction of adverse events for non-intensive care unit patients using ∼100 variables (vitals, lab results, assessments, etc.). There are several missing predictor values for most patients, which in the health sciences is the norm, rather than the exception. A Bayesian approach is presented that addresses many of the shortcomings to standard approaches to missing predictors: (i) treatment of the uncertainty due to imputation is straight-forward in the Bayesian paradigm, (ii) the predictor distribution is flexibly modeled as an infinite normal mixture with latent variables to explicitly account for discrete predictors (i.e., as in multivariate probit regression models), and (iii) certain missing not at random situations can be handled effectively by allowing the indicator of missingness into the predictor distribution only to inform the distribution of the missing variables. The proposed approach also has the benefit of providing a distribution for the prediction, including the uncertainty inherent in the imputation. Therefore, we can ask questions such as: is it possible this individual is at high risk but we are missing too much information to know for sure? How much would we reduce the uncertainty in our risk prediction by obtaining a particular missing value? This approach is applied to the BPR problem resulting in excellent predictive capability to identify deteriorating patients. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 32-46 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2019.1604359 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604359 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:32-46 Template-Type: ReDIF-Article 1.0 Author-Name: Adam N. Smith Author-X-Name-First: Adam N. Author-X-Name-Last: Smith Author-Name: Greg M. Allenby Author-X-Name-First: Greg M. Author-X-Name-Last: Allenby Title: Demand Models With Random Partitions Abstract: Many economic models of consumer demand require researchers to partition sets of products or attributes prior to the analysis. These models are common in applied problems when the product space is large or spans multiple categories. While the partition is traditionally fixed a priori, we let the partition be a model parameter and propose a Bayesian method for inference. The challenge is that demand systems are commonly multivariate models that are not conditionally conjugate with respect to partition indices, precluding the use of Gibbs sampling. We solve this problem by constructing a new location-scale partition distribution that can generate random-walk Metropolis–Hastings proposals and also serve as a prior. Our method is illustrated in the context of a store-level category demand model, where we find that allowing for partition uncertainty is important for preserving model flexibility, improving demand forecasts, and learning about the structure of demand. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 47-65 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2019.1604360 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604360 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:47-65 Template-Type: ReDIF-Article 1.0 Author-Name: Matthew J. Heaton Author-X-Name-First: Matthew J. Author-X-Name-Last: Heaton Author-Name: Candace Berrett Author-X-Name-First: Candace Author-X-Name-Last: Berrett Author-Name: Sierra Pugh Author-X-Name-First: Sierra Author-X-Name-Last: Pugh Author-Name: Amber Evans Author-X-Name-First: Amber Author-X-Name-Last: Evans Author-Name: Chantel Sloan Author-X-Name-First: Chantel Author-X-Name-Last: Sloan Title: Modeling Bronchiolitis Incidence Proportions in the Presence of Spatio-Temporal Uncertainty Abstract: Bronchiolitis (inflammation of the lower respiratory tract) in infants is primarily due to viral infection and is the single most common cause of infant hospitalization in the United States. To increase epidemiological understanding of bronchiolitis (and, subsequently, develop better prevention strategies), this research analyzes data on infant bronchiolitis cases from the U.S. Military Health System between the years 2003–2013 in Norfolk, Virginia, USA. For privacy reasons, child home addresses, birth dates, and diagnosis dates were randomized (jittered) creating spatio-temporal uncertainty in the geographic location and timing of bronchiolitis incidents. Using spatio-temporal point patterns, we created a modeling strategy that accounts for the jittering to estimate and quantify the uncertainty for the incidence proportion (IP) of bronchiolitis. Additionally, we regress the IP onto key covariates including pollution where we adequately account for uncertainty in the pollution levels (i.e., covariate uncertainty) using a land use regression model. Our analysis results indicate that the IP is positively associated with sulfur dioxide and population density. Further, we demonstrate how scientific conclusions may change if various sources of uncertainty (either spatio-temporal or covariate uncertainty) are not accounted for. Code submitted with this article was checked by an Associate Editor for Reproducibility and is available as an online supplement. Journal: Journal of the American Statistical Association Pages: 66-78 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2019.1609480 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1609480 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:66-78 Template-Type: ReDIF-Article 1.0 Author-Name: DouglasR. Wilson Author-X-Name-First: DouglasR. Author-X-Name-Last: Wilson Author-Name: JosephG. Ibrahim Author-X-Name-First: JosephG. Author-X-Name-Last: Ibrahim Author-Name: Wei Sun Author-X-Name-First: Wei Author-X-Name-Last: Sun Title: Mapping Tumor-Specific Expression QTLs in Impure Tumor Samples Abstract: The study of gene expression quantitative trait loci (eQTL) is an effective approach to illuminate the functional roles of genetic variants. Computational methods have been developed for eQTL mapping using gene expression data from microarray or RNA-seq technology. Application of these methods for eQTL mapping in tumor tissues is problematic because tumor tissues are composed of both tumor and infiltrating normal cells (e.g., immune cells) and eQTL effects may vary between tumor and infiltrating normal cells. To address this challenge, we have developed a new method for eQTL mapping using RNA-seq data from tumor samples. Our method separately estimates the eQTL effects in tumor and infiltrating normal cells using both total expression and allele-specific expression (ASE). We demonstrate that our method controls Type I error rate and has higher power than some alternative approaches. We applied our method to study RNA-seq data from The Cancer Genome Atlas and illustrated the similarities and differences of eQTL effects in tumor and normal cells. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 79-89 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2019.1609968 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1609968 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:79-89 Template-Type: ReDIF-Article 1.0 Author-Name: Hojin Yang Author-X-Name-First: Hojin Author-X-Name-Last: Yang Author-Name: Veerabhadran Baladandayuthapani Author-X-Name-First: Veerabhadran Author-X-Name-Last: Baladandayuthapani Author-Name: Arvind U.K. Rao Author-X-Name-First: Arvind U.K. Author-X-Name-Last: Rao Author-Name: Jeffrey S. Morris Author-X-Name-First: Jeffrey S. Author-X-Name-Last: Morris Title: Quantile Function on Scalar Regression Analysis for Distributional Data Abstract: Radiomics involves the study of tumor images to identify quantitative markers explaining cancer heterogeneity. The predominant approach is to extract hundreds to thousands of image features, including histogram features comprised of summaries of the marginal distribution of pixel intensities, which leads to multiple testing problems and can miss out on insights not contained in the selected features. In this paper, we present methods to model the entire marginal distribution of pixel intensities via the quantile function as functional data, regressed on a set of demographic, clinical, and genetic predictors to investigate their effects of imaging-based cancer heterogeneity. We call this approach quantile functional regression, regressing subject-specific marginal distributions across repeated measurements on a set of covariates, allowing us to assess which covariates are associated with the distribution in a global sense, as well as to identify distributional features characterizing these differences, including mean, variance, skewness, heavy-tailedness, and various upper and lower quantiles. To account for smoothness in the quantile functions, account for intrafunctional correlation, and gain statistical power, we introduce custom basis functions we call quantlets that are sparse, regularized, near-lossless, and empirically defined, adapting to the features of a given dataset and containing a Gaussian subspace so non-Gaussianness can be assessed. We fit this model using a Bayesian framework that uses nonlinear shrinkage of quantlet coefficients to regularize the functional regression coefficients and provides fully Bayesian inference after fitting a Markov chain Monte Carlo. We demonstrate the benefit of the basis space modeling through simulation studies, and apply the method to Magnetic resonance imaging (MRI)-based radiomic dataset from Glioblastoma Multiforme to relate imaging-based quantile functions to various demographic, clinical, and genetic predictors, finding specific differences in tumor pixel intensity distribution between males and females and between tumors with and without DDIT3 mutations. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 90-106 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2019.1609969 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1609969 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:90-106 Template-Type: ReDIF-Article 1.0 Author-Name: Gareth M. James Author-X-Name-First: Gareth M. Author-X-Name-Last: James Author-Name: Courtney Paulson Author-X-Name-First: Courtney Author-X-Name-Last: Paulson Author-Name: Paat Rusmevichientong Author-X-Name-First: Paat Author-X-Name-Last: Rusmevichientong Title: Penalized and Constrained Optimization: An Application to High-Dimensional Website Advertising Abstract: Firms are increasingly transitioning advertising budgets to Internet display campaigns, but this transition poses new challenges. These campaigns use numerous potential metrics for success (e.g., reach or click rate), and because each website represents a separate advertising opportunity, this is also an inherently high-dimensional problem. Further, advertisers often have constraints they wish to place on their campaign, such as targeting specific sub-populations or websites. These challenges require a method flexible enough to accommodate thousands of websites, as well as numerous metrics and campaign constraints. Motivated by this application, we consider the general constrained high-dimensional problem, where the parameters satisfy linear constraints. We develop the Penalized and Constrained optimization method (PaC) to compute the solution path for high-dimensional, linearly constrained criteria. PaC is extremely general; in addition to internet advertising, we show it encompasses many other potential applications, such as portfolio estimation, monotone curve estimation, and the generalized lasso. Computing the PaC coefficient path poses technical challenges, but we develop an efficient algorithm over a grid of tuning parameters. Through extensive simulations, we show PaC performs well. Finally, we apply PaC to a proprietary dataset in an exemplar Internet advertising case study and demonstrate its superiority over existing methods in this practical setting. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 107-122 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2019.1609970 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1609970 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:107-122 Template-Type: ReDIF-Article 1.0 Author-Name: Victor Chernozhukov Author-X-Name-First: Victor Author-X-Name-Last: Chernozhukov Author-Name: Iván Fernández-Val Author-X-Name-First: Iván Author-X-Name-Last: Fernández-Val Author-Name: Blaise Melly Author-X-Name-First: Blaise Author-X-Name-Last: Melly Author-Name: Kaspar Wüthrich Author-X-Name-First: Kaspar Author-X-Name-Last: Wüthrich Title: Generic Inference on Quantile and Quantile Effect Functions for Discrete Outcomes Abstract: Quantile and quantile effect (QE) functions are important tools for descriptive and causal analysis due to their natural and intuitive interpretation. Existing inference methods for these functions do not apply to discrete random variables. This article offers a simple, practical construction of simultaneous confidence bands for quantile and QE functions of possibly discrete random variables. It is based on a natural transformation of simultaneous confidence bands for distribution functions, which are readily available for many problems. The construction is generic and does not depend on the nature of the underlying problem. It works in conjunction with parametric, semiparametric, and nonparametric modeling methods for observed and counterfactual distributions, and does not depend on the sampling scheme. We apply our method to characterize the distributional impact of insurance coverage on health care utilization and obtain the distributional decomposition of the racial test score gap. We find that universal insurance coverage increases the number of doctor visits across the entire distribution, and that the racial test score gap is small at early ages but grows with age due to socio-economic factors especially at the top of the distribution. Supplementary materials (additional results, R package, replication files) for this article are available online. Journal: Journal of the American Statistical Association Pages: 123-137 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2019.1611581 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1611581 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:123-137 Template-Type: ReDIF-Article 1.0 Author-Name: Saharon Rosset Author-X-Name-First: Saharon Author-X-Name-Last: Rosset Author-Name: Ryan J. Tibshirani Author-X-Name-First: Ryan J. Author-X-Name-Last: Tibshirani Title: From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation Abstract: In statistical prediction, classical approaches for model selection and model evaluation based on covariance penalties are still widely used. Most of the literature on this topic is based on what we call the “Fixed-X” assumption, where covariate values are assumed to be nonrandom. By contrast, it is often more reasonable to take a “Random-X” view, where the covariate values are independently drawn for both training and prediction. To study the applicability of covariance penalties in this setting, we propose a decomposition of Random-X prediction error in which the randomness in the covariates contributes to both the bias and variance components. This decomposition is general, but we concentrate on the fundamental case of ordinary least-squares (OLS) regression. We prove that in this setting the move from Fixed-X to Random-X prediction results in an increase in both bias and variance. When the covariates are normally distributed and the linear model is unbiased, all terms in this decomposition are explicitly computable, which yields an extension of Mallows’ Cp that we call RCp. RCp also holds asymptotically for certain classes of nonnormal covariates. When the noise variance is unknown, plugging in the usual unbiased estimate leads to an approach that we call RCp ^$\widehat{{\rm RCp}}$, which is closely related to Sp, and generalized cross-validation (GCV). For excess bias, we propose an estimate based on the “shortcut-formula” for ordinary cross-validation (OCV), resulting in an approach we call RCp+. Theoretical arguments and numerical simulations suggest that RCp+ is typically superior to OCV, though the difference is small. We further examine the Random-X error of other popular estimators. The surprising result we get for ridge regression is that, in the heavily regularized regime, Random-X variance is smaller than Fixed-X variance, which can lead to smaller overall Random-X error. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 138-151 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1424632 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1424632 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:138-151 Template-Type: ReDIF-Article 1.0 Author-Name: Xiaotong Shen Author-X-Name-First: Xiaotong Author-X-Name-Last: Shen Author-Name: Hsin-Cheng Huang Author-X-Name-First: Hsin-Cheng Author-X-Name-Last: Huang Title: Discussion of “From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation” Journal: Journal of the American Statistical Association Pages: 152-156 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1543597 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543597 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:152-156 Template-Type: ReDIF-Article 1.0 Author-Name: Stefan Wager Author-X-Name-First: Stefan Author-X-Name-Last: Wager Title: Cross-Validation, Risk Estimation, and Model Selection: Comment on a Paper by Rosset and Tibshirani Journal: Journal of the American Statistical Association Pages: 157-160 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2020.1727235 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1727235 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:157-160 Template-Type: ReDIF-Article 1.0 Author-Name: Saharon Rosset Author-X-Name-First: Saharon Author-X-Name-Last: Rosset Author-Name: Ryan J. Tibshirani Author-X-Name-First: Ryan J. Author-X-Name-Last: Tibshirani Title: From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation: Rejoinder Journal: Journal of the American Statistical Association Pages: 161-162 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2020.1727236 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1727236 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:161-162 Template-Type: ReDIF-Article 1.0 Author-Name: Maya B. Mathur Author-X-Name-First: Maya B. Author-X-Name-Last: Mathur Author-Name: Tyler J. VanderWeele Author-X-Name-First: Tyler J. Author-X-Name-Last: VanderWeele Title: Sensitivity Analysis for Unmeasured Confounding in Meta-Analyses Abstract: Random-effects meta-analyses of observational studies can produce biased estimates if the synthesized studies are subject to unmeasured confounding. We propose sensitivity analyses quantifying the extent to which unmeasured confounding of specified magnitude could reduce to below a certain threshold the proportion of true effect sizes that are scientifically meaningful. We also develop converse methods to estimate the strength of confounding capable of reducing the proportion of scientifically meaningful true effects to below a chosen threshold. These methods apply when a “bias factor” is assumed to be normally distributed across studies or is assessed across a range of fixed values. Our estimators are derived using recently proposed sharp bounds on confounding bias within a single study that do not make assumptions regarding the unmeasured confounders themselves or the functional form of their relationships with the exposure and outcome of interest. We provide an R package, EValue, and a free website that compute point estimates and inference and produce plots for conducting such sensitivity analyses. These methods facilitate principled use of random-effects meta-analyses of observational studies to assess the strength of causal evidence for a hypothesis. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 163-172 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1529598 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1529598 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:163-172 Template-Type: ReDIF-Article 1.0 Author-Name: Xi Chen Author-X-Name-First: Xi Author-X-Name-Last: Chen Author-Name: Qihang Lin Author-X-Name-First: Qihang Author-X-Name-Last: Lin Author-Name: Bodhisattva Sen Author-X-Name-First: Bodhisattva Author-X-Name-Last: Sen Title: On Degrees of Freedom of Projection Estimators With Applications to Multivariate Nonparametric Regression Abstract: Abstract–In this article, we consider the nonparametric regression problem with multivariate predictors. We provide a characterization of the degrees of freedom and divergence for estimators of the unknown regression function, which are obtained as outputs of linearly constrained quadratic optimization procedures; namely, minimizers of the least-squares criterion with linear constraints and/or quadratic penalties. As special cases of our results, we derive explicit expressions for the degrees of freedom in many nonparametric regression problems, for example, bounded isotonic regression, multivariate (penalized) convex regression, and additive total variation regularization. Our theory also yields, as special cases, known results on the degrees of freedom of many well-studied estimators in the statistics literature, such as ridge regression, Lasso and generalized Lasso. Our results can be readily used to choose the tuning parameter(s) involved in the estimation procedure by minimizing the Stein’s unbiased risk estimate. As a by-product of our analysis we derive an interesting connection between bounded isotonic regression and isotonic regression on a general partially ordered set, which is of independent interest. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 173-186 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1537917 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537917 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:173-186 Template-Type: ReDIF-Article 1.0 Author-Name: Fangzheng Xie Author-X-Name-First: Fangzheng Author-X-Name-Last: Xie Author-Name: Yanxun Xu Author-X-Name-First: Yanxun Author-X-Name-Last: Xu Title: Bayesian Repulsive Gaussian Mixture Model Abstract: We develop a general class of Bayesian repulsive Gaussian mixture models that encourage well-separated clusters, aiming at reducing potentially redundant components produced by independent priors for locations (such as the Dirichlet process). The asymptotic results for the posterior distribution of the proposed models are derived, including posterior consistency and posterior contraction rate in the context of nonparametric density estimation. More importantly, we show that compared to the independent prior on the component centers, the repulsive prior introduces additional shrinkage effect on the tail probability of the posterior number of components, which serves as a measurement of the model complexity. In addition, a generalized urn model that allows a random number of components and correlated component centers is developed based on the exchangeable partition distribution, which gives rise to the corresponding blocked-collapsed Gibbs sampler for posterior inference. We evaluate the performance and demonstrate the advantages of the proposed methodology through extensive simulation studies and real data analysis. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 187-203 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1537918 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537918 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:187-203 Template-Type: ReDIF-Article 1.0 Author-Name: Hui Zhao Author-X-Name-First: Hui Author-X-Name-Last: Zhao Author-Name: Qiwei Wu Author-X-Name-First: Qiwei Author-X-Name-Last: Wu Author-Name: Gang Li Author-X-Name-First: Gang Author-X-Name-Last: Li Author-Name: Jianguo Sun Author-X-Name-First: Jianguo Author-X-Name-Last: Sun Title: Simultaneous Estimation and Variable Selection for Interval-Censored Data With Broken Adaptive Ridge Regression Abstract: The simultaneous estimation and variable selection for Cox model has been discussed by several authors when one observes right-censored failure time data. However, there does not seem to exist an established procedure for interval-censored data, a more general and complex type of failure time data, except two parametric procedures. To address this, we propose a broken adaptive ridge (BAR) regression procedure that combines the strengths of the quadratic regularization and the adaptive weighted bridge shrinkage. In particular, the method allows for the number of covariates to be diverging with the sample size. Under some weak regularity conditions, unlike most of the existing variable selection methods, we establish both the oracle property and the grouping effect of the proposed BAR procedure. An extensive simulation study is conducted and indicates that the proposed approach works well in practical situations and deals with the collinearity problem better than the other oracle-like methods. An application is also provided. Journal: Journal of the American Statistical Association Pages: 204-216 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1537922 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537922 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:204-216 Template-Type: ReDIF-Article 1.0 Author-Name: Yunzhang Zhu Author-X-Name-First: Yunzhang Author-X-Name-Last: Zhu Author-Name: Xiaotong Shen Author-X-Name-First: Xiaotong Author-X-Name-Last: Shen Author-Name: Wei Pan Author-X-Name-First: Wei Author-X-Name-Last: Pan Title: On High-Dimensional Constrained Maximum Likelihood Inference Abstract: Inference in a high-dimensional situation may involve regularization of a certain form to treat overparameterization, imposing challenges to inference. The common practice of inference uses either a regularized model, as in inference after model selection, or bias-reduction known as “debias.” While the first ignores statistical uncertainty inherent in regularization, the second reduces the bias inbred in regularization at the expense of increased variance. In this article, we propose a constrained maximum likelihood method for hypothesis testing involving unspecific nuisance parameters, with a focus of alleviating the impact of regularization on inference. Particularly, for general composite hypotheses, we unregularize hypothesized parameters whereas regularizing nuisance parameters through a L0-constraint controlling the degree of sparseness. This approach is analogous to semiparametric likelihood inference in a high-dimensional situation. On this ground, for the Gaussian graphical model and linear regression, we derive conditions under which the asymptotic distribution of the constrained likelihood ratio is established, permitting parameter dimension increasing with the sample size. Interestingly, the corresponding limiting distribution is the chi-square or normal, depending on if the co-dimension of a test is finite or increases with the sample size, leading to asymptotic similar tests. This goes beyond the classical Wilks phenomenon. Numerically, we demonstrate that the proposed method performs well against it competitors in various scenarios. Finally, we apply the proposed method to infer linkages in brain network analysis based on MRI data, to contrast Alzheimer’s disease patients against healthy subjects. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 217-230 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1540986 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1540986 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:217-230 Template-Type: ReDIF-Article 1.0 Author-Name: Qian Feng Author-X-Name-First: Qian Author-X-Name-Last: Feng Author-Name: Quang Vuong Author-X-Name-First: Quang Author-X-Name-Last: Vuong Author-Name: Haiqing Xu Author-X-Name-First: Haiqing Author-X-Name-Last: Xu Title: Estimation of Heterogeneous Individual Treatment Effects With Endogenous Treatments Abstract: This article estimates individual treatment effects (ITE) and its probability distribution in a triangular model with binary-valued endogenous treatments. Our estimation procedure takes two steps. First, we estimate the counterfactual outcome and hence, the ITE for every observational unit in the sample. Second, we estimate the ITE density function of the whole population. Our estimation method does not suffer from the ill-posed inverse problem associated with inverting a nonlinear functional. Asymptotic properties of the proposed method are established. We study its finite sample properties in Monte Carlo experiments. We also illustrate our approach with an empirical application assessing the effects of 401(k) retirement programs on personal savings. Our results show that there exists a small but statistically significant proportion of individuals who experience negative effects, although the majority of ITEs is positive. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 231-240 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1543121 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543121 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:231-240 Template-Type: ReDIF-Article 1.0 Author-Name: Pavlo Mozharovskyi Author-X-Name-First: Pavlo Author-X-Name-Last: Mozharovskyi Author-Name: Julie Josse Author-X-Name-First: Julie Author-X-Name-Last: Josse Author-Name: François Husson Author-X-Name-First: François Author-X-Name-Last: Husson Title: Nonparametric Imputation by Data Depth Abstract: We present single imputation method for missing values which borrows the idea of data depth—a measure of centrality defined for an arbitrary point of a space with respect to a probability distribution or data cloud. This consists in iterative maximization of the depth of each observation with missing values, and can be employed with any properly defined statistical depth function. For each single iteration, imputation reverts to optimization of quadratic, linear, or quasiconcave functions that are solved analytically by linear programming or the Nelder–Mead method. As it accounts for the underlying data topology, the procedure is distribution free, allows imputation close to the data geometry, can make prediction in situations where local imputation (k-nearest neighbors, random forest) cannot, and has attractive robustness and asymptotic properties under elliptical symmetry. It is shown that a special case—when using the Mahalanobis depth—has direct connection to well-known methods for the multivariate normal model, such as iterated regression and regularized PCA. The methodology is extended to multiple imputation for data stemming from an elliptically symmetric distribution. Simulation and real data studies show good results compared with existing popular alternatives. The method has been implemented as an R-package. Supplementary materials for the article are available online. Journal: Journal of the American Statistical Association Pages: 241-253 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1543123 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543123 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:241-253 Template-Type: ReDIF-Article 1.0 Author-Name: Qiang Sun Author-X-Name-First: Qiang Author-X-Name-Last: Sun Author-Name: Wen-Xin Zhou Author-X-Name-First: Wen-Xin Author-X-Name-Last: Zhou Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Title: Adaptive Huber Regression Abstract: Big data can easily be contaminated by outliers or contain variables with heavy-tailed distributions, which makes many conventional methods inadequate. To address this challenge, we propose the adaptive Huber regression for robust estimation and inference. The key observation is that the robustification parameter should adapt to the sample size, dimension and moments for optimal tradeoff between bias and robustness. Our theoretical framework deals with heavy-tailed distributions with bounded (1+δ) th moment for any δ>0 . We establish a sharp phase transition for robust estimation of regression parameters in both low and high dimensions: when δ≥1 , the estimator admits a sub-Gaussian-type deviation bound without sub-Gaussian assumptions on the data, while only a slower rate is available in the regime 0<δ<1 and the transition is smooth and optimal. In addition, we extend the methodology to allow both heavy-tailed predictors and observation noise. Simulation studies lend further support to the theory. In a genetic study of cancer cell lines that exhibit heavy-tailedness, the proposed methods are shown to be more robust and predictive. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 254-265 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1543124 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543124 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:254-265 Template-Type: ReDIF-Article 1.0 Author-Name: Tomohiro Ando Author-X-Name-First: Tomohiro Author-X-Name-Last: Ando Author-Name: Jushan Bai Author-X-Name-First: Jushan Author-X-Name-Last: Bai Title: Quantile Co-Movement in Financial Markets: A Panel Quantile Model With Unobserved Heterogeneity Abstract: This article introduces a new procedure for analyzing the quantile co-movement of a large number of financial time series based on a large-scale panel data model with factor structures. The proposed method attempts to capture the unobservable heterogeneity of each of the financial time series based on sensitivity to explanatory variables and to the unobservable factor structure. In our model, the dimension of the common factor structure varies across quantiles, and the explanatory variables is allowed to depend on the factor structure. The proposed method allows for both cross-sectional and serial dependence, and heteroscedasticity, which are common in financial markets.We propose new estimation procedures for both frequentist and Bayesian frameworks. Consistency and asymptotic normality of the proposed estimator are established. We also propose a new model selection criterion for determining the number of common factors together with theoretical support.We apply the method to analyze the returns for over 6000 international stocks from over 60 countries during the subprime crisis, European sovereign debt crisis, and subsequent period. The empirical analysis indicates that the common factor structure varies across quantiles. We find that the common factors for the quantiles and the common factors for the mean are different. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 266-279 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1543598 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543598 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:266-279 Template-Type: ReDIF-Article 1.0 Author-Name: Cencheng Shen Author-X-Name-First: Cencheng Author-X-Name-Last: Shen Author-Name: Carey E. Priebe Author-X-Name-First: Carey E. Author-X-Name-Last: Priebe Author-Name: Joshua T. Vogelstein Author-X-Name-First: Joshua T. Author-X-Name-Last: Vogelstein Title: From Distance Correlation to Multiscale Graph Correlation Abstract: Understanding and developing a correlation measure that can detect general dependencies is not only imperative to statistics and machine learning, but also crucial to general scientific discovery in the big data age. In this paper, we establish a new framework that generalizes distance correlation (Dcorr)—a correlation measure that was recently proposed and shown to be universally consistent for dependence testing against all joint distributions of finite moments—to the multiscale graph correlation (MGC). By using the characteristic functions and incorporating the nearest neighbor machinery, we formalize the population version of local distance correlations, define the optimal scale in a given dependency, and name the optimal local correlation as MGC. The new theoretical framework motivates a theoretically sound sample MGC and allows a number of desirable properties to be proved, including the universal consistency, convergence, and almost unbiasedness of the sample version. The advantages of MGC are illustrated via a comprehensive set of simulations with linear, nonlinear, univariate, multivariate, and noisy dependencies, where it loses almost no power in monotone dependencies while achieving better performance in general dependencies, compared to Dcorr and other popular methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 280-291 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1543125 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543125 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:280-291 Template-Type: ReDIF-Article 1.0 Author-Name: Hai Shu Author-X-Name-First: Hai Author-X-Name-Last: Shu Author-Name: Xiao Wang Author-X-Name-First: Xiao Author-X-Name-Last: Wang Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Title: D-CCA: A Decomposition-Based Canonical Correlation Analysis for High-Dimensional Datasets Abstract: A typical approach to the joint analysis of two high-dimensional datasets is to decompose each data matrix into three parts: a low-rank common matrix that captures the shared information across datasets, a low-rank distinctive matrix that characterizes the individual information within a single dataset, and an additive noise matrix. Existing decomposition methods often focus on the orthogonality between the common and distinctive matrices, but inadequately consider the more necessary orthogonal relationship between the two distinctive matrices. The latter guarantees that no more shared information is extractable from the distinctive matrices. We propose decomposition-based canonical correlation analysis (D-CCA), a novel decomposition method that defines the common and distinctive matrices from the ℓ2 space of random variables rather than the conventionally used Euclidean space, with a careful construction of the orthogonal relationship between distinctive matrices. D-CCA represents a natural generalization of the traditional canonical correlation analysis. The proposed estimators of common and distinctive matrices are shown to be consistent and have reasonably better performance than some state-of-the-art methods in both simulated data and the real data analysis of breast cancer data obtained from The Cancer Genome Atlas. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 292-306 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1543599 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543599 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:292-306 Template-Type: ReDIF-Article 1.0 Author-Name: Wenliang Pan Author-X-Name-First: Wenliang Author-X-Name-Last: Pan Author-Name: Xueqin Wang Author-X-Name-First: Xueqin Author-X-Name-Last: Wang Author-Name: Heping Zhang Author-X-Name-First: Heping Author-X-Name-Last: Zhang Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Author-Name: Jin Zhu Author-X-Name-First: Jin Author-X-Name-Last: Zhu Title: Ball Covariance: A Generic Measure of Dependence in Banach Space Abstract: Technological advances in science and engineering have led to the routine collection of large and complex data objects, where the dependence structure among those objects is often of great interest. Those complex objects (e.g., different brain subcortical structures) often reside in some Banach spaces, and hence their relationship cannot be well characterized by most of the existing measures of dependence such as correlation coefficients developed in Hilbert spaces. To overcome the limitations of the existing measures, we propose Ball Covariance as a generic measure of dependence between two random objects in two possibly different Banach spaces. Our Ball Covariance possesses the following attractive properties: (i) It is nonparametric and model-free, which make the proposed measure robust to model mis-specification; (ii) It is nonnegative and equal to zero if and only if two random objects in two separable Banach spaces are independent; (iii) Empirical Ball Covariance is easy to compute and can be used as a test statistic of independence. We present both theoretical and numerical results to reveal the potential power of the Ball Covariance in detecting dependence. Also importantly, we analyze two real datasets to demonstrate the usefulness of Ball Covariance in the complex dependence detection. Supplementary materials for this article are avaiable online. Journal: Journal of the American Statistical Association Pages: 307-317 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1543600 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543600 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:307-317 Template-Type: ReDIF-Article 1.0 Author-Name: Raffaele Argiento Author-X-Name-First: Raffaele Author-X-Name-Last: Argiento Author-Name: Andrea Cremaschi Author-X-Name-First: Andrea Author-X-Name-Last: Cremaschi Author-Name: Marina Vannucci Author-X-Name-First: Marina Author-X-Name-Last: Vannucci Title: Hierarchical Normalized Completely Random Measures to Cluster Grouped Data Abstract: In this article, we propose a Bayesian nonparametric model for clustering grouped data. We adopt a hierarchical approach: at the highest level, each group of data is modeled according to a mixture, where the mixing distributions are conditionally independent normalized completely random measures (NormCRMs) centered on the same base measure, which is itself a NormCRM. The discreteness of the shared base measure implies that the processes at the data level share the same atoms. This desired feature allows to cluster together observations of different groups. We obtain a representation of the hierarchical clustering model by marginalizing with respect to the infinite dimensional NormCRMs. We investigate the properties of the clustering structure induced by the proposed model and provide theoretical results concerning the distribution of the number of clusters, within and between groups. Furthermore, we offer an interpretation in terms of generalized Chinese restaurant franchise process, which allows for posterior inference under both conjugate and nonconjugate models. We develop algorithms for fully Bayesian inference and assess performances by means of a simulation study and a real-data illustration. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 318-333 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2019.1594833 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1594833 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:318-333 Template-Type: ReDIF-Article 1.0 Author-Name: Hyebin Song Author-X-Name-First: Hyebin Author-X-Name-Last: Song Author-Name: Garvesh Raskutti Author-X-Name-First: Garvesh Author-X-Name-Last: Raskutti Title: PUlasso: High-Dimensional Variable Selection With Presence-Only Data Abstract: In various real-world problems, we are presented with classification problems with positive and unlabeled data, referred to as presence-only responses. In this article we study variable selection in the context of presence only responses where the number of features or covariates p is large. The combination of presence-only responses and high dimensionality presents both statistical and computational challenges. In this article, we develop the PUlasso algorithm for variable selection and classification with positive and unlabeled responses. Our algorithm involves using the majorization-minimization framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speed-ups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm converges to a stationary point, and then prove that any stationary point within a local neighborhood of the true parameter achieves the minimax optimal mean-squared error under both strict sparsity and group sparsity assumptions. We also demonstrate through simulations that our algorithm outperforms state-of-the-art algorithms in the moderate p settings in terms of classification performance. Finally, we demonstrate that our PUlasso algorithm performs well on a biochemistry example. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 334-347 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1546587 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1546587 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:334-347 Template-Type: ReDIF-Article 1.0 Author-Name: Radoslav Harman Author-X-Name-First: Radoslav Author-X-Name-Last: Harman Author-Name: Lenka Filová Author-X-Name-First: Lenka Author-X-Name-Last: Filová Author-Name: Peter Richtárik Author-X-Name-First: Peter Author-X-Name-Last: Richtárik Title: A Randomized Exchange Algorithm for Computing Optimal Approximate Designs of Experiments Abstract: We propose a class of subspace ascent methods for computing optimal approximate designs that covers existing algorithms as well as new and more efficient ones. Within this class of methods, we construct a simple, randomized exchange algorithm (REX). Numerical comparisons suggest that the performance of REX is comparable or superior to that of state-of-the-art methods across a broad range of problem structures and sizes. We focus on the most commonly used criterion of D-optimality, which also has applications beyond experimental design, such as the construction of the minimum-volume ellipsoid containing a given set of data points. For D-optimality, we prove that the proposed algorithm converges to the optimum. We also provide formulas for the optimal exchange of weights in the case of the criterion of A-optimality, which enable one to use REX and some other algorithms for computing A-optimal and I-optimal designs. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 348-361 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1546588 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1546588 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:348-361 Template-Type: ReDIF-Article 1.0 Author-Name: Yingying Fan Author-X-Name-First: Yingying Author-X-Name-Last: Fan Author-Name: Emre Demirkaya Author-X-Name-First: Emre Author-X-Name-Last: Demirkaya Author-Name: Gaorong Li Author-X-Name-First: Gaorong Author-X-Name-Last: Li Author-Name: Jinchi Lv Author-X-Name-First: Jinchi Author-X-Name-Last: Lv Title: RANK: Large-Scale Inference With Graphical Nonlinear Knockoffs Abstract: Power and reproducibility are key to enabling refined scientific discoveries in contemporary big data applications with general high-dimensional nonlinear models. In this article, we provide theoretical foundations on the power and robustness for the model-X knockoffs procedure introduced recently in Candès, Fan, Janson and Lv in high-dimensional setting when the covariate distribution is characterized by Gaussian graphical model. We establish that under mild regularity conditions, the power of the oracle knockoffs procedure with known covariate distribution in high-dimensional linear models is asymptotically one as sample size goes to infinity. When moving away from the ideal case, we suggest the modified model-X knockoffs method called graphical nonlinear knockoffs (RANK) to accommodate the unknown covariate distribution. We provide theoretical justifications on the robustness of our modified procedure by showing that the false discovery rate (FDR) is asymptotically controlled at the target level and the power is asymptotically one with the estimated covariate distribution. To the best of our knowledge, this is the first formal theoretical result on the power for the knockoffs procedure. Simulation results demonstrate that compared to existing approaches, our method performs competitively in both FDR control and power. A real dataset is analyzed to further assess the performance of the suggested knockoffs procedure. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 362-379 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1546589 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1546589 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:362-379 Template-Type: ReDIF-Article 1.0 Author-Name: Peng Wu Author-X-Name-First: Peng Author-X-Name-Last: Wu Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Yuanjia Wang Author-X-Name-First: Yuanjia Author-X-Name-Last: Wang Title: Matched Learning for Optimizing Individualized Treatment Strategies Using Electronic Health Records Abstract: Current guidelines for treatment decision making largely rely on data from randomized controlled trials (RCTs) studying average treatment effects. They may be inadequate to make individualized treatment decisions in real-world settings. Large-scale electronic health records (EHR) provide opportunities to fulfill the goals of personalized medicine and learn individualized treatment rules (ITRs) depending on patient-specific characteristics from real-world patient data. In this work, we tackle challenges with EHRs and propose a machine learning approach based on matching (M-learning) to estimate optimal ITRs from EHRs. This new learning method performs matching instead of inverse probability weighting as commonly used in many existing methods for estimating ITRs to more accurately assess individuals’ treatment responses to alternative treatments and alleviate confounding. Matching-based value functions are proposed to compare matched pairs under a unified framework, where various types of outcomes for measuring treatment response (including continuous, ordinal, and discrete outcomes) can easily be accommodated. We establish the Fisher consistency and convergence rate of M-learning. Through extensive simulation studies, we show that M-learning outperforms existing methods when propensity scores are misspecified or when unmeasured confounders are present in certain scenarios. Lastly, we apply M-learning to estimate optimal personalized second-line treatments for type 2 diabetes patients to achieve better glycemic control or reduce major complications using EHRs from New York Presbyterian Hospital. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 380-392 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1549050 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1549050 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:380-392 Template-Type: ReDIF-Article 1.0 Author-Name: Yaowu Liu Author-X-Name-First: Yaowu Author-X-Name-Last: Liu Author-Name: Jun Xie Author-X-Name-First: Jun Author-X-Name-Last: Xie Title: Cauchy Combination Test: A Powerful Test With Analytic p-Value Calculation Under Arbitrary Dependency Structures Abstract: Abstract–Combining individual p-values to aggregate multiple small effects has a long-standing interest in statistics, dating back to the classic Fisher’s combination test. In modern large-scale data analysis, correlation and sparsity are common features and efficient computation is a necessary requirement for dealing with massive data. To overcome these challenges, we propose a new test that takes advantage of the Cauchy distribution. Our test statistic has a simple form and is defined as a weighted sum of Cauchy transformation of individual p-values. We prove a nonasymptotic result that the tail of the null distribution of our proposed test statistic can be well approximated by a Cauchy distribution under arbitrary dependency structures. Based on this theoretical result, the p-value calculation of our proposed test is not only accurate, but also as simple as the classic z-test or t-test, making our test well suited for analyzing massive data. We further show that the power of the proposed test is asymptotically optimal in a strong sparsity setting. Extensive simulations demonstrate that the proposed test has both strong power against sparse alternatives and a good accuracy with respect to p-value calculations, especially for very small p-values. The proposed test has also been applied to a genome-wide association study of Crohn’s disease and compared with several existing tests. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 393-402 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1554485 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1554485 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:393-402 Template-Type: ReDIF-Article 1.0 Author-Name: Dehan Kong Author-X-Name-First: Dehan Author-X-Name-Last: Kong Author-Name: Baiguo An Author-X-Name-First: Baiguo Author-X-Name-Last: An Author-Name: Jingwen Zhang Author-X-Name-First: Jingwen Author-X-Name-Last: Zhang Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Title: L2RM: Low-Rank Linear Regression Models for High-Dimensional Matrix Responses Abstract: The aim of this article is to develop a low-rank linear regression model to correlate a high-dimensional response matrix with a high-dimensional vector of covariates when coefficient matrices have low-rank structures. We propose a fast and efficient screening procedure based on the spectral norm of each coefficient matrix to deal with the case when the number of covariates is extremely large. We develop an efficient estimation procedure based on the trace norm regularization, which explicitly imposes the low rank structure of coefficient matrices. When both the dimension of response matrix and that of covariate vector diverge at the exponential order of the sample size, we investigate the sure independence screening property under some mild conditions. We also systematically investigate some theoretical properties of our estimation procedure including estimation consistency, rank consistency, and nonasymptotic error bound under some mild conditions. We further establish a theoretical guarantee for the overall solution of our two-step screening and estimation procedure. We examine the finite-sample performance of our screening and estimation methods using simulations and a large-scale imaging genetic dataset collected by the Philadelphia Neurodevelopmental Cohort study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 403-424 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1555092 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1555092 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:403-424 Template-Type: ReDIF-Article 1.0 Author-Name: Jean-Pierre Florens Author-X-Name-First: Jean-Pierre Author-X-Name-Last: Florens Author-Name: Léopold Simar Author-X-Name-First: Léopold Author-X-Name-Last: Simar Author-Name: Ingrid Van Keilegom Author-X-Name-First: Ingrid Author-X-Name-Last: Van Keilegom Title: Estimation of the Boundary of a Variable Observed With Symmetric Error Abstract: Consider the model Y=X+ε with X=τ+Z , where τ is an unknown constant (the boundary of X), Z is a random variable defined on R+ , ε is a symmetric error, and ε and Z are independent. Based on an iid sample of Y, we aim at identifying and estimating the boundary τ when the law of ε is unknown (apart from symmetry) and in particular its variance is unknown. We propose an estimation procedure based on a minimal distance approach and by making use of Laguerre polynomials. Asymptotic results as well as finite sample simulations are shown. The paper also proposes an extension to stochastic frontier analysis, where the model is conditional to observed variables. The model becomes Y=τ(w1,w2)+Z+ε , where Y is a cost, w1 are the observed outputs and w2 represents the observed values of other conditioning variables, so Z is the cost inefficiency. Some simulations illustrate again how the approach works in finite samples, and the proposed procedure is illustrated with data coming from post offices in France. Journal: Journal of the American Statistical Association Pages: 425-441 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1555093 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1555093 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:425-441 Template-Type: ReDIF-Article 1.0 Author-Name: Jingshen Wang Author-X-Name-First: Jingshen Author-X-Name-Last: Wang Author-Name: Xuming He Author-X-Name-First: Xuming Author-X-Name-Last: He Author-Name: Gongjun Xu Author-X-Name-First: Gongjun Author-X-Name-Last: Xu Title: Debiased Inference on Treatment Effect in a High-Dimensional Model Abstract: This article concerns the potential bias in statistical inference on treatment effects when a large number of covariates are present in a linear or partially linear model. While the estimation bias in an under-fitted model is well understood, we address a lesser-known bias that arises from an over-fitted model. The over-fitting bias can be eliminated through data splitting at the cost of statistical efficiency, and we show that smoothing over random data splits can be pursued to mitigate the efficiency loss. We also discuss some of the existing methods for debiased inference and provide insights into their intrinsic bias-variance trade-off, which leads to an improvement in bias controls. Under appropriate conditions, we show that the proposed estimators for the treatment effects are asymptotically normal and their variances can be well estimated. We discuss the pros and cons of various methods both theoretically and empirically, and show that the proposed methods are valuable options in post-selection inference. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 442-454 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1558062 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1558062 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:442-454 Template-Type: ReDIF-Article 1.0 Author-Name: Timothée Tabouy Author-X-Name-First: Timothée Author-X-Name-Last: Tabouy Author-Name: Pierre Barbillon Author-X-Name-First: Pierre Author-X-Name-Last: Barbillon Author-Name: Julien Chiquet Author-X-Name-First: Julien Author-X-Name-Last: Chiquet Title: Variational Inference for Stochastic Block Models From Sampled Data Abstract: This article deals with nonobserved dyads during the sampling of a network and consecutive issues in the inference of the stochastic block model (SBM). We review sampling designs and recover missing at random (MAR) and not missing at random (NMAR) conditions for the SBM. We introduce variants of the variational EM algorithm for inferring the SBM under various sampling designs (MAR and NMAR) all available as an R package. Model selection criteria based on integrated classification likelihood are derived for selecting both the number of blocks and the sampling design. We investigate the accuracy and the range of applicability of these algorithms with simulations. We explore two real-world networks from ethnology (seed circulation network) and biology (protein–protein interaction network), where the interpretations considerably depend on the sampling designs considered. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 455-466 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1562934 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1562934 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:455-466 Template-Type: ReDIF-Article 1.0 Author-Name: Aurore Delaigle Author-X-Name-First: Aurore Author-X-Name-Last: Delaigle Author-Name: Wei Huang Author-X-Name-First: Wei Author-X-Name-Last: Huang Author-Name: Shaoke Lei Author-X-Name-First: Shaoke Author-X-Name-Last: Lei Title: Estimation of Conditional Prevalence From Group Testing Data With Missing Covariates Abstract: We consider estimating the conditional prevalence of a disease from data pooled according to the group testing mechanism. Consistent estimators have been proposed in the literature, but they rely on the data being available for all individuals. In infectious disease studies where group testing is frequently applied, the covariate is often missing for some individuals. There, unless the missing mechanism occurs completely at random, applying the existing techniques to the complete cases without adjusting for missingness does not generally provide consistent estimators, and finding appropriate modifications is challenging. We develop a consistent spline estimator, derive its theoretical properties, and show how to adapt local polynomial and likelihood estimators to the missing data problem. We illustrate the numerical performance of our methods on simulated and real examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 467-480 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2019.1566071 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1566071 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:467-480 Template-Type: ReDIF-Article 1.0 Author-Name: Thibault Vatter Author-X-Name-First: Thibault Author-X-Name-Last: Vatter Title: Simulating Copulas: Stochastic Models, Sampling Algorithms, and Applications Journal: Journal of the American Statistical Association Pages: 481-482 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2020.1721244 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1721244 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:481-482 Template-Type: ReDIF-Article 1.0 Author-Name: Peter M. Aronow Author-X-Name-First: Peter M. Author-X-Name-Last: Aronow Author-Name: Fredrik Sävje Author-X-Name-First: Fredrik Author-X-Name-Last: Sävje Title: The Book of Why: The New Science of Cause and Effect Journal: Journal of the American Statistical Association Pages: 482-485 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2020.1721245 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1721245 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:482-485 Template-Type: ReDIF-Article 1.0 Author-Name: Noor Azina Ismail Author-X-Name-First: Noor Azina Author-X-Name-Last: Ismail Title: Measuring Agreement: Models, Methods, and Applications. Journal: Journal of the American Statistical Association Pages: 485-486 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2020.1721246 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1721246 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:485-486 Template-Type: ReDIF-Article 1.0 Author-Name: Qing Wang Author-X-Name-First: Qing Author-X-Name-Last: Wang Title: Multivariate Kernel Smoothing and Its Applications Journal: Journal of the American Statistical Association Pages: 486-486 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2020.1721247 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1721247 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:486-486 Template-Type: ReDIF-Article 1.0 Author-Name: Anita D. Behme Author-X-Name-First: Anita D. Author-X-Name-Last: Behme Title: Theory of Stochastic Objects: Probability, Stochastic Processes and Inference. Journal: Journal of the American Statistical Association Pages: 486-487 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2020.1721248 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1721248 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:486-487 Template-Type: ReDIF-Article 1.0 Author-Name: Oliver Y. Chén Author-X-Name-First: Oliver Y. Author-X-Name-Last: Chén Title: Big Data in Omics and Imaging: Integrated Analysis and Causal Inference. Journal: Journal of the American Statistical Association Pages: 487-488 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2020.1721249 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1721249 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:487-488 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: RETRACTED ARTICLE: Smoothing with Couplings of Conditional Particle Filters Journal: Journal of the American Statistical Association Pages: 489-489 Issue: 529 Volume: 115 Year: 2020 Month: 1 X-DOI: 10.1080/01621459.2018.1505625 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1505625 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:489-489 Template-Type: ReDIF-Article 1.0 Author-Name: Shiwen Zhao Author-X-Name-First: Shiwen Author-X-Name-Last: Zhao Author-Name: Barbara E. Engelhardt Author-X-Name-First: Barbara E. Author-X-Name-Last: Engelhardt Author-Name: Sayan Mukherjee Author-X-Name-First: Sayan Author-X-Name-Last: Mukherjee Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Fast Moment Estimation for Generalized Latent Dirichlet Models Abstract: We develop a generalized method of moments (GMM) approach for fast parameter estimation in a new class of Dirichlet latent variable models with mixed data types. Parameter estimation via GMM has computational and statistical advantages over alternative methods, such as expectation maximization, variational inference, and Markov chain Monte Carlo. A key computational advantage of our method, Moment Estimation for latent Dirichlet models (MELD), is that parameter estimation does not require instantiation of the latent variables. Moreover, performance is agnostic to distributional assumptions of the observations. We derive population moment conditions after marginalizing out the sample-specific Dirichlet latent variables. The moment conditions only depend on component mean parameters. We illustrate the utility of our approach on simulated data, comparing results from MELD to alternative methods, and we show the promise of our approach through the application to several datasets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1528-1540 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1341839 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1341839 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1528-1540 Template-Type: ReDIF-Article 1.0 Author-Name: Yichi Zhang Author-X-Name-First: Yichi Author-X-Name-Last: Zhang Author-Name: Eric B. Laber Author-X-Name-First: Eric B. Author-X-Name-Last: Laber Author-Name: Marie Davidian Author-X-Name-First: Marie Author-X-Name-Last: Davidian Author-Name: Anastasios A. Tsiatis Author-X-Name-First: Anastasios A. Author-X-Name-Last: Tsiatis Title: Interpretable Dynamic Treatment Regimes Abstract: Precision medicine is currently a topic of great interest in clinical and intervention science.  A key component of precision medicine is that it is evidence-based, that is, data-driven, and consequently there has been tremendous interest in estimation of precision medicine strategies using observational or randomized study data. One way to formalize precision medicine is through a treatment regime, which is a sequence of decision rules, one per stage of clinical intervention, that map up-to-date patient information to a recommended treatment. An optimal treatment regime is defined as maximizing the mean of some cumulative clinical outcome if applied to a population of interest. It is well-known that even under simple generative models an optimal treatment regime can be a highly nonlinear function of patient information. Consequently, a focal point of recent methodological research has been the development of flexible models for estimating optimal treatment regimes. However, in many settings, estimation of an optimal treatment regime is an exploratory analysis intended to generate new hypotheses for subsequent research and not to directly dictate treatment to new patients. In such settings, an estimated treatment regime that is interpretable in a domain context may be of greater value than an unintelligible treatment regime built using “black-box” estimation methods. We propose an estimator of an optimal treatment regime composed of a sequence of decision rules, each expressible as a list of “if-then” statements that can be presented as either a paragraph or as a simple flowchart that is immediately interpretable to domain experts. The discreteness of these lists precludes smooth, that is, gradient-based, methods of estimation and leads to nonstandard asymptotics. Nevertheless, we provide a computationally efficient estimation algorithm, prove consistency of the proposed estimator, and derive rates of convergence. We illustrate the proposed methods using a series of simulation examples and application to data from a sequential clinical trial on bipolar disorder. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1541-1549 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1345743 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1345743 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1541-1549 Template-Type: ReDIF-Article 1.0 Author-Name: Ling Zhou Author-X-Name-First: Ling Author-X-Name-Last: Zhou Author-Name: Huazhen Lin Author-X-Name-First: Huazhen Author-X-Name-Last: Lin Author-Name: Hua Liang Author-X-Name-First: Hua Author-X-Name-Last: Liang Title: Efficient Estimation of the Nonparametric Mean and Covariance Functions for Longitudinal and Sparse Functional Data Abstract: We consider the estimation of mean and covariance functions for longitudinal and sparse functional data by using the full quasi-likelihood coupling a modification of the local kernel smoothing method. The proposed estimators are shown to be consistent, asymptotically normal, and semiparametrically efficient in terms of their linear functionals. Their superiority to the competitors is further illustrated numerically through simulation studies. The method is applied to analyze AIDS study and atmospheric study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1550-1564 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1356317 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356317 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1550-1564 Template-Type: ReDIF-Article 1.0 Author-Name: Clément Dombry Author-X-Name-First: Clément Author-X-Name-Last: Dombry Author-Name: Mathieu Ribatet Author-X-Name-First: Mathieu Author-X-Name-Last: Ribatet Author-Name: Stilian Stoev Author-X-Name-First: Stilian Author-X-Name-Last: Stoev Title: Probabilities of Concurrent Extremes Abstract: The statistical modeling of spatial extremes has been an active area of recent research with a growing domain of applications. Much of the existing methodology, however, focuses on the magnitudes of extreme events rather than on their timing. To address this gap, this article investigates the notion of extremal concurrence. Suppose that daily temperatures are measured at several synoptic stations. We say that extremes are concurrent if record maximum temperatures occur simultaneously, that is, on the same day for all stations. It is important to be able to understand, quantify, and model extremal concurrence. Under general conditions, we show that the finite sample concurrence probability converges to an asymptotic quantity, deemed extremal concurrence probability. Using Palm calculus, we establish general expressions for the extremal concurrence probability through the max-stable process emerging in the limit of the component-wise maxima of the sample. Explicit forms of the extremal concurrence probabilities are obtained for various max-stable models and several estimators are introduced. In particular, we prove that the pairwise extremal concurrence probability for max-stable vectors is precisely equal to the Kendall’s τ. The estimators are evaluated from simulations and applied to study temperature extremes in the United States. Results demonstrate that concurrence probability can be used to study, for example, the effect of global climate phenomena such as the El Niño Southern Oscillation (ENSO) or global warming on the spatial structure and areal impact of extremes. Journal: Journal of the American Statistical Association Pages: 1565-1582 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1356318 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356318 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1565-1582 Template-Type: ReDIF-Article 1.0 Author-Name: Yinchu Zhu Author-X-Name-First: Yinchu Author-X-Name-Last: Zhu Author-Name: Jelena Bradic Author-X-Name-First: Jelena Author-X-Name-Last: Bradic Title: Linear Hypothesis Testing in Dense High-Dimensional Linear Models Abstract: We propose a methodology for testing linear hypothesis in high-dimensional linear models. The proposed test does not impose any restriction on the size of the model, that is, model sparsity or the loading vector representing the hypothesis. Providing asymptotically valid methods for testing general linear functions of the regression parameters in high-dimensions is extremely challenging—especially without making restrictive or unverifiable assumptions on the number of nonzero elements. We propose to test the moment conditions related to the newly designed restructured regression, where the inputs are transformed and augmented features. These new features incorporate the structure of the null hypothesis directly. The test statistics are constructed in such a way that lack of sparsity in the original model parameter does not present a problem for the theoretical justification of our procedures. We establish asymptotically exact control on Type I error without imposing any sparsity assumptions on model parameter or the vector representing the linear hypothesis. Our method is also shown to achieve certain optimality in detecting deviations from the null hypothesis. We demonstrate the favorable finite-sample performance of the proposed methods, via a number of numerical and a real data example. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1583-1600 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1356319 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356319 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1583-1600 Template-Type: ReDIF-Article 1.0 Author-Name: Xiaoxiao Sun Author-X-Name-First: Xiaoxiao Author-X-Name-Last: Sun Author-Name: Pang Du Author-X-Name-First: Pang Author-X-Name-Last: Du Author-Name: Xiao Wang Author-X-Name-First: Xiao Author-X-Name-Last: Wang Author-Name: Ping Ma Author-X-Name-First: Ping Author-X-Name-Last: Ma Title: Optimal Penalized Function-on-Function Regression Under a Reproducing Kernel Hilbert Space Framework Abstract: Many scientific studies collect data where the response and predictor variables are both functions of time, location, or some other covariate. Understanding the relationship between these functional variables is a common goal in these studies. Motivated from two real-life examples, we present in this article a function-on-function regression model that can be used to analyze such kind of functional data. Our estimator of the 2D coefficient function is the optimizer of a form of penalized least squares where the penalty enforces a certain level of smoothness on the estimator. Our first result is the representer theorem which states that the exact optimizer of the penalized least squares actually resides in a data-adaptive finite-dimensional subspace although the optimization problem is defined on a function space of infinite dimensions. This theorem then allows us an easy incorporation of the Gaussian quadrature into the optimization of the penalized least squares, which can be carried out through standard numerical procedures. We also show that our estimator achieves the minimax convergence rate in mean prediction under the framework of function-on-function regression. Extensive simulation studies demonstrate the numerical advantages of our method over the existing ones, where a sparse functional data extension is also introduced. The proposed method is then applied to our motivating examples of the benchmark Canadian weather data and a histone regulation study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1601-1611 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1356320 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356320 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1601-1611 Template-Type: ReDIF-Article 1.0 Author-Name: Matthew Dawson Author-X-Name-First: Matthew Author-X-Name-Last: Dawson Author-Name: Hans-Georg Müller Author-X-Name-First: Hans-Georg Author-X-Name-Last: Müller Title: Dynamic Modeling of Conditional Quantile Trajectories, With Application to Longitudinal Snippet Data Abstract: Longitudinal data are often plagued with sparsity of time points where measurements are available. The functional data analysis perspective has been shown to provide an effective and flexible approach to address this problem for the case where measurements are sparse but their times are randomly distributed over an interval. Here, we focus on a different scenario where available data can be characterized as snippets, which are very short stretches of longitudinal measurements. For each subject, the stretch of available data is much shorter than the time frame of interest, a common occurrence in accelerated longitudinal studies. An added challenge is introduced if a time proxy that is basic for usual longitudinal modeling is not available. This situation arises in the case of Alzheimer’s disease and comparable scenarios, where one is interested in time dynamics of declining performance, but the time of disease onset is unknown and chronological age does not provide a meaningful time reference for longitudinal modeling. Our main methodological contribution to address these challenges is to introduce conditional quantile trajectories for monotonic processes that emerge as solutions of a dynamic system. Our proposed estimates for these trajectories are shown to be uniformly consistent. Conditional quantile trajectories are useful descriptors of processes that quantify deterioration over time, such as hippocampal volumes in Alzheimer’s patients. We demonstrate how the proposed approach can be applied to longitudinal snippets data sampled from such processes. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1612-1624 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1356321 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356321 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1612-1624 Template-Type: ReDIF-Article 1.0 Author-Name: Minjie Fan Author-X-Name-First: Minjie Author-X-Name-Last: Fan Author-Name: Debashis Paul Author-X-Name-First: Debashis Author-X-Name-Last: Paul Author-Name: Thomas C. M. Lee Author-X-Name-First: Thomas C. M. Author-X-Name-Last: Lee Author-Name: Tomoko Matsuo Author-X-Name-First: Tomoko Author-X-Name-Last: Matsuo Title: Modeling Tangential Vector Fields on a Sphere Abstract: Physical processes that manifest as tangential vector fields on a sphere are common in geophysical and environmental sciences. These naturally occurring vector fields are often subject to physical constraints, such as being curl-free or divergence-free. We start with constructing parametric models for curl-free and divergence-free vector fields that are tangential to the unit sphere through applying the surface gradient or the surface curl operator to a scalar random potential field on the unit sphere. Using the Helmholtz–Hodge decomposition, we then construct a class of simple but flexible parametric models for general tangential vector fields, which are represented as a sum of a curl-free and a divergence-free components. We propose a likelihood-based parameter estimation procedure, and show that fast computation is possible even for large datasets when the observations are on a regular latitude–longitude grid. Characteristics and practical utility of the proposed methodology are illustrated through extensive simulation studies and an application to a dataset of ocean surface wind velocities collected by satellite-based scatterometers. We also compare our model with a bivariate Matérn model and a non-stationary bivariate global model. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1625-1636 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1356322 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356322 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1625-1636 Template-Type: ReDIF-Article 1.0 Author-Name: Bing Li Author-X-Name-First: Bing Author-X-Name-Last: Li Author-Name: Eftychia Solea Author-X-Name-First: Eftychia Author-X-Name-Last: Solea Title: A Nonparametric Graphical Model for Functional Data With Application to Brain Networks Based on fMRI Abstract: We introduce a nonparametric graphical model whose observations on vertices are functions. Many modern applications, such as electroencephalogram and functional magnetic resonance imaging (fMRI), produce data are of this type. The model is based on additive conditional independence (ACI), a statistical relation that captures the spirit of conditional independence without resorting to multi-dimensional kernels. The random functions are assumed to reside in a Hilbert space. No distributional assumption is imposed on the random functions: instead, their statistical relations are characterized nonparametrically by a second Hilbert space, which is a reproducing kernel Hilbert space whose kernel is determined by the inner product of the first Hilbert space. A precision operator is then constructed based on the second space, which characterizes ACI, and hence also the graph. The resulting estimator is relatively easy to compute, requiring no iterative optimization or inversion of large matrices. We establish the consistency and the convergence rate of the estimator. Through simulation studies we demonstrate that the estimator performs better than the functional Gaussian graphical model when the relations among vertices are nonlinear or heteroscedastic. The method is applied to an fMRI dataset to construct brain networks for patients with attention-deficit/hyperactivity disorder. Supplementary materials for this article are available online Journal: Journal of the American Statistical Association Pages: 1637-1655 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1356726 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356726 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1637-1655 Template-Type: ReDIF-Article 1.0 Author-Name: Siddhartha Chib Author-X-Name-First: Siddhartha Author-X-Name-Last: Chib Author-Name: Minchul Shin Author-X-Name-First: Minchul Author-X-Name-Last: Shin Author-Name: Anna Simoni Author-X-Name-First: Anna Author-X-Name-Last: Simoni Title: Bayesian Estimation and Comparison of Moment Condition Models Abstract: In this article, we develop a Bayesian semiparametric analysis of moment condition models by casting the problem within the exponentially tilted empirical likelihood (ETEL) framework. We use this framework to develop a fully Bayesian analysis of correctly and misspecified moment condition models. We show that even under misspecification, the Bayesian ETEL posterior distribution satisfies the Bernstein–von Mises (BvM) theorem. We also develop a unified approach based on marginal likelihoods and Bayes factors for comparing different moment-restricted models and for discarding any misspecified moment restrictions. Computation of the marginal likelihoods is by the method of Chib (1995) as extended to Metropolis–Hastings samplers in Chib and Jeliazkov in 2001. We establish the model selection consistency of the marginal likelihood and show that the marginal likelihood favors the model with the minimum number of parameters and the maximum number of valid moment restrictions. When the models are misspecified, the marginal likelihood model selection procedure selects the model that is closer to the (unknown) true data-generating process in terms of the Kullback–Leibler divergence. The ideas and results in this article broaden the theoretical underpinning and value of the Bayesian ETEL framework with many practical applications. The discussion is illuminated through several examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1656-1668 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1358172 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1358172 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1656-1668 Template-Type: ReDIF-Article 1.0 Author-Name: Zachary M. Thomas Author-X-Name-First: Zachary M. Author-X-Name-Last: Thomas Author-Name: Steven N. MacEachern Author-X-Name-First: Steven N. Author-X-Name-Last: MacEachern Author-Name: Mario Peruggia Author-X-Name-First: Mario Author-X-Name-Last: Peruggia Title: Reconciling Curvature and Importance Sampling Based Procedures for Summarizing Case Influence in Bayesian Models Abstract: Methods for summarizing case influence in Bayesian models take essentially two forms: (1) use common divergence measures for calculating distances between the full-data posterior and the case-deleted posterior, and (2) measure the impact of infinitesimal perturbations to the likelihood to study local case influence. Methods based on approach (1) lead naturally to considering the behavior of case-deletion importance sampling weights (the weights used to approximate samples from the case-deleted posterior using samples from the full posterior). Methods based on approach (2) lead naturally to considering the local curvature of the Kullback–Leibler divergence of the full posterior from a geometrically perturbed quasi-posterior. By examining the connections between the two approaches, we establish a rationale for employing low-dimensional summaries of case influence obtained entirely via the variance–covariance matrix of the log importance sampling weights. We illustrate the use of the proposed diagnostics using real and simulated data. Supplementary materials are available online. Journal: Journal of the American Statistical Association Pages: 1669-1683 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1360777 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1360777 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1669-1683 Template-Type: ReDIF-Article 1.0 Author-Name: Veronika Ročková Author-X-Name-First: Veronika Author-X-Name-Last: Ročková Title: Particle EM for Variable Selection Abstract: Despite its long history of success, the EM algorithm has been vulnerable to local entrapment when the posterior/likelihood is multi-modal. This is particularly pronounced in spike-and-slab posterior distributions for Bayesian variable selection. The main thrust of this article is to introduce the particle EM algorithm, a new population-based optimization strategy that harvests multiple modes in search spaces that present many local maxima. Motivated by nonparametric variational Bayes strategies, particle EM achieves this goal by deploying an ensemble of interactive repulsive particles. These particles are geared toward uncharted areas of the posterior, providing a more comprehensive summary of its topography than simple parallel EM deployments. A sequential Monte Carlo variant of particle EM is also proposed that explores a sequence of annealed posteriors by sampling from a set of mutually avoiding particles. Particle EM outputs a deterministic reconstruction of the posterior distribution for approximate fully Bayes inference by capturing its essential modes and mode weights. This reconstruction reflects model selection uncertainty and is supported by asymptotic considerations, which indicate that the requisite number of particles need not be large in the presence of sparsity (when p > n). Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1684-1697 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1360778 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1360778 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1684-1697 Template-Type: ReDIF-Article 1.0 Author-Name: Chengchun Shi Author-X-Name-First: Chengchun Author-X-Name-Last: Shi Author-Name: Wenbin Lu Author-X-Name-First: Wenbin Author-X-Name-Last: Lu Author-Name: Rui Song Author-X-Name-First: Rui Author-X-Name-Last: Song Title: A Massive Data Framework for M-Estimators with Cubic-Rate Abstract: The divide and conquer method is a common strategy for handling massive data. In this article, we study the divide and conquer method for cubic-rate estimators under the massive data framework. We develop a general theory for establishing the asymptotic distribution of the aggregated M-estimators using a weighted average with weights depending on the subgroup sample sizes. Under certain condition on the growing rate of the number of subgroups, the resulting aggregated estimators are shown to have faster convergence rate and asymptotic normal distribution, which are more tractable in both computation and inference than the original M-estimators based on pooled data. Our theory applies to a wide class of M-estimators with cube root convergence rate, including the location estimator, maximum score estimator, and value search estimator. Empirical performance via simulations and a real data application also validate our theoretical findings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1698-1709 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1360779 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1360779 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1698-1709 Template-Type: ReDIF-Article 1.0 Author-Name: Lorin Crawford Author-X-Name-First: Lorin Author-X-Name-Last: Crawford Author-Name: Kris C. Wood Author-X-Name-First: Kris C. Author-X-Name-Last: Wood Author-Name: Xiang Zhou Author-X-Name-First: Xiang Author-X-Name-Last: Zhou Author-Name: Sayan Mukherjee Author-X-Name-First: Sayan Author-X-Name-Last: Mukherjee Title: Bayesian Approximate Kernel Regression With Variable Selection Abstract: Nonlinear kernel regression models are often used in statistics and machine learning because they are more accurate than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. In this article, we propose a novel framework that provides an effect size analog for each explanatory variable in Bayesian kernel regression models when the kernel is shift-invariant—for example, the Gaussian kernel. We use function analytic properties of shift-invariant reproducing kernel Hilbert spaces (RKHS) to define a linear vector space that: (i) captures nonlinear structure, and (ii) can be projected onto the original explanatory variables. This projection onto the original explanatory variables serves as an analog of effect sizes. The specific function analytic property we use is that shift-invariant kernel functions can be approximated via random Fourier bases. Based on the random Fourier expansion, we propose a computationally efficient class of Bayesian approximate kernel regression (BAKR) models for both nonlinear regression and binary classification for which one can compute an analog of effect sizes. We illustrate the utility of BAKR by examining two important problems in statistical genetics: genomic selection (i.e., phenotypic prediction) and association mapping (i.e., inference of significant variants or loci). State-of-the-art methods for genomic selection and association mapping are based on kernel regression and linear models, respectively. BAKR is the first method that is competitive in both settings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1710-1721 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1361830 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1361830 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1710-1721 Template-Type: ReDIF-Article 1.0 Author-Name: Jonas Harnau Author-X-Name-First: Jonas Author-X-Name-Last: Harnau Author-Name: Bent Nielsen Author-X-Name-First: Bent Author-X-Name-Last: Nielsen Title: Over-Dispersed Age-Period-Cohort Models Abstract: We consider inference and forecasting for aggregate data organized in a two-way table with age and cohort as indices, but without measures of exposure. This is modeled using a Poisson likelihood with an age-period-cohort structure for the mean while allowing for over-dispersion. We propose a repetitive structure that keeps the dimension of the table fixed while increasing the latent exposure. For this, we use a class of infinitely divisible distributions which include a variety of compound Poisson models and Poisson mixture models. This results in asymptotic F inference and t forecast distributions. Journal: Journal of the American Statistical Association Pages: 1722-1732 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1366908 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1366908 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1722-1732 Template-Type: ReDIF-Article 1.0 Author-Name: Roger S. Zoh Author-X-Name-First: Roger S. Author-X-Name-Last: Zoh Author-Name: Abhra Sarkar Author-X-Name-First: Abhra Author-X-Name-Last: Sarkar Author-Name: Raymond J. Carroll Author-X-Name-First: Raymond J. Author-X-Name-Last: Carroll Author-Name: Bani K. Mallick Author-X-Name-First: Bani K. Author-X-Name-Last: Mallick Title: A Powerful Bayesian Test for Equality of Means in High Dimensions Abstract: We develop a Bayes factor-based testing procedure for comparing two population means in high-dimensional settings. In ‘large-p-small-n” settings, Bayes factors based on proper priors require eliciting a large and complex p × p covariance matrix, whereas Bayes factors based on Jeffrey’s prior suffer the same impediment as the classical Hotelling T2 test statistic as they involve inversion of ill-formed sample covariance matrices. To circumvent this limitation, we propose that the Bayes factor be based on lower dimensional random projections of the high-dimensional data vectors. We choose the prior under the alternative to maximize the power of the test for a fixed threshold level, yielding a restricted most powerful Bayesian test (RMPBT). The final test statistic is based on the ensemble of Bayes factors corresponding to multiple replications of randomly projected data. We show that the test is unbiased and, under mild conditions, is also locally consistent. We demonstrate the efficacy of the approach through simulated and real data examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1733-1741 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1371024 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1371024 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1733-1741 Template-Type: ReDIF-Article 1.0 Author-Name: David Rossell Author-X-Name-First: David Author-X-Name-Last: Rossell Author-Name: Francisco J. Rubio Author-X-Name-First: Francisco J. Author-X-Name-Last: Rubio Title: Tractable Bayesian Variable Selection: Beyond Normality Abstract: Bayesian variable selection often assumes normality, but the effects of model misspecification are not sufficiently understood. There are sound reasons behind this assumption, particularly for large p: ease of interpretation, analytical, and computational convenience. More flexible frameworks exist, including semi- or nonparametric models, often at the cost of some tractability. We propose a simple extension that allows for skewness and thicker-than-normal tails but preserves tractability. It leads to easy interpretation and a log-concave likelihood that facilitates optimization and integration. We characterize asymptotically parameter estimation and Bayes factor rates, under certain model misspecification. Under suitable conditions, misspecified Bayes factors induce sparsity at the same rates than under the correct model. However, the rates to detect signal change by an exponential factor, often reducing sensitivity. These deficiencies can be ameliorated by inferring the error distribution, a simple strategy that can improve inference substantially. Our work focuses on the likelihood and can be combined with any likelihood penalty or prior, but here we focus on nonlocal priors to induce extra sparsity and ameliorate finite-sample effects caused by misspecification. We show the importance of considering the likelihood rather than solely the prior, for Bayesian variable selection. The methodology is in R package ‘mombf.’  Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1742-1758 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1371025 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1371025 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1742-1758 Template-Type: ReDIF-Article 1.0 Author-Name: Francis K. C. Hui Author-X-Name-First: Francis K. C. Author-X-Name-Last: Hui Author-Name: Samuel Müller Author-X-Name-First: Samuel Author-X-Name-Last: Müller Author-Name: A. H. Welsh Author-X-Name-First: A. H. Author-X-Name-Last: Welsh Title: Sparse Pairwise Likelihood Estimation for Multivariate Longitudinal Mixed Models Abstract: It is becoming increasingly common in longitudinal studies to collect and analyze data on multiple responses. For example, in the social sciences we may be interested in uncovering the factors driving mental health of individuals over time, where mental health is measured using a set of questionnaire items. One approach to analyzing such multi-dimensional data is multivariate mixed models, an extension of the standard univariate mixed model to handle multiple responses. Estimating multivariate mixed models presents a considerable challenge however, let alone performing variable selection to uncover which covariates are important in driving each response. Motivated by composite likelihood ideas, we propose a new approach for estimation and fixed effects selection in multivariate mixed models, called approximate pairwise likelihood estimation and shrinkage (APLES). The method works by constructing a quadratic approximation to each term in the pairwise likelihood function, and then augmenting this approximate pairwise likelihood with a penalty that encourages both individual and group coefficient sparsity. This leads to a relatively fast method of selection, as we can use coordinate ascent type methods to then construct the full regularization path for the model. Our method is the first to extend penalized likelihood estimation to multivariate generalized linear mixed models. We show that the APLES estimator attains a composite likelihood version of the oracle property.  We propose a new information criterion for selecting the tuning parameter, which employs a dynamic model complexity penalty to facilitate aggressive shrinkage, and demonstrate that it asymptotically leads to selection consistency, that is, leads to the true model being selected. A simulation study demonstrates that the APLES estimator outperforms several univariate selection methods based on analyzing each outcome separately. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1759-1769 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1371026 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1371026 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1759-1769 Template-Type: ReDIF-Article 1.0 Author-Name: Denis Rybin Author-X-Name-First: Denis Author-X-Name-Last: Rybin Author-Name: Robert Lew Author-X-Name-First: Robert Author-X-Name-Last: Lew Author-Name: Michael J. Pencina Author-X-Name-First: Michael J. Author-X-Name-Last: Pencina Author-Name: Maurizio Fava Author-X-Name-First: Maurizio Author-X-Name-Last: Fava Author-Name: Gheorghe Doros Author-X-Name-First: Gheorghe Author-X-Name-Last: Doros Title: Placebo Response as a Latent Characteristic: Application to Analysis of Sequential Parallel Comparison Design Studies Abstract: In clinical trials, placebo response can affect the inference about efficacy of the studied treatment. It is important to have a robust way to classify trial subjects with respect to their response to placebo. Simple, criterion-based classification may lead to classification error and bias the inference. The uncertainty about placebo response characteristic has to be factored into the treatment effect estimation. We propose a novel approach that views the placebo response as a latent characteristic and the study sample as an unlabeled mixture of “placebo responders” and “placebo nonresponders.” The likelihood-based methodology is used to estimate the treatment effect corrected for placebo response as defined within sequential parallel comparison design. Journal: Journal of the American Statistical Association Pages: 1411-1430 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1375930 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1375930 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1411-1430 Template-Type: ReDIF-Article 1.0 Author-Name: Ruth Heller Author-X-Name-First: Ruth Author-X-Name-Last: Heller Author-Name: Nilanjan Chatterjee Author-X-Name-First: Nilanjan Author-X-Name-Last: Chatterjee Author-Name: Abba Krieger Author-X-Name-First: Abba Author-X-Name-Last: Krieger Author-Name: Jianxin Shi Author-X-Name-First: Jianxin Author-X-Name-Last: Shi Title: Post-Selection Inference Following Aggregate Level Hypothesis Testing in Large-Scale Genomic Data Abstract: In many genomic applications, hypotheses tests are performed for powerful identification of signals by aggregating test-statistics across units within naturally defined classes. Following class-level testing, it is naturally of interest to identify the lower level units which contain true signals. Testing the individual units within a class without taking into account the fact that the class was selected using an aggregate-level test-statistic, will produce biased inference. We develop a hypothesis testing framework that guarantees control for false positive rates conditional on the fact that the class was selected. Specifically, we develop procedures for calculating unit level p-values that allows rejection of null hypotheses controlling for two types of conditional error rates, one relating to family-wise rate and the other relating to false discovery rate. We use simulation studies to illustrate validity and power of the proposed procedure in comparison to several possible alternatives. We illustrate the power of the method in a natural application involving whole-genome expression quantitative trait loci (eQTL) analysis across 17 tissue types using data from The Cancer Genome Atlas (TCGA) Project. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1770-1783 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1375933 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1375933 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1770-1783 Template-Type: ReDIF-Article 1.0 Author-Name: Federico A. Bugni Author-X-Name-First: Federico A. Author-X-Name-Last: Bugni Author-Name: Ivan A. Canay Author-X-Name-First: Ivan A. Author-X-Name-Last: Canay Author-Name: Azeem M. Shaikh Author-X-Name-First: Azeem M. Author-X-Name-Last: Shaikh Title: Inference Under Covariate-Adaptive Randomization Abstract: This article studies inference for the average treatment effect in randomized controlled trials with covariate-adaptive randomization. Here, by covariate-adaptive randomization, we mean randomization schemes that first stratify according to baseline covariates and then assign treatment status so as to achieve “balance” within each stratum. Our main requirement is that the randomization scheme assigns treatment status within each stratum so that the fraction of units being assigned to treatment within each stratum has a well behaved distribution centered around a proportion π as the sample size tends to infinity. Such schemes include, for example, Efron’s biased-coin design and stratified block randomization. When testing the null hypothesis that the average treatment effect equals a prespecified value in such settings, we first show the usual two-sample t-test is conservative in the sense that it has limiting rejection probability under the null hypothesis no greater than and typically strictly less than the nominal level. We show, however, that a simple adjustment to the usual standard error of the two-sample t-test leads to a test that is exact in the sense that its limiting rejection probability under the null hypothesis equals the nominal level. Next, we consider the usual t-test (on the coefficient on treatment assignment) in a linear regression of outcomes on treatment assignment and indicators for each of the strata. We show that this test is exact for the important special case of randomization schemes with π=12$\pi = \frac{1}{2}$, but is otherwise conservative. We again provide a simple adjustment to the standard errors that yields an exact test more generally. Finally, we study the behavior of a modified version of a permutation test, which we refer to as the covariate-adaptive permutation test, that only permutes treatment status for units within the same stratum. When applied to the usual two-sample t-statistic, we show that this test is exact for randomization schemes with π=12$\pi = \frac{1}{2}$ and that additionally achieve what we refer to as “strong balance.” For randomization schemes with π≠12$\pi \not= \frac{1}{2}$, this test may have limiting rejection probability under the null hypothesis strictly greater than the nominal level. When applied to a suitably adjusted version of the two-sample t-statistic, however, we show that this test is exact for all randomization schemes that achieve “strong balance,” including those with π≠12$\pi \not= \frac{1}{2}$. A simulation study confirms the practical relevance of our theoretical results. We conclude with recommendations for empirical practice and an empirical illustration. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1784-1796 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1375934 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1375934 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1784-1796 Template-Type: ReDIF-Article 1.0 Author-Name: Chenglong Ye Author-X-Name-First: Chenglong Author-X-Name-Last: Ye Author-Name: Yi Yang Author-X-Name-First: Yi Author-X-Name-Last: Yang Author-Name: Yuhong Yang Author-X-Name-First: Yuhong Author-X-Name-Last: Yang Title: Sparsity Oriented Importance Learning for High-Dimensional Linear Regression Abstract: With now well-recognized nonnegligible model selection uncertainty, data analysts should no longer be satisfied with the output of a single final model from a model selection process, regardless of its sophistication. To improve reliability and reproducibility in model choice, one constructive approach is to make good use of a sound variable importance measure. Although interesting importance measures are available and increasingly used in data analysis, little theoretical justification has been done. In this article, we propose a new variable importance measure, sparsity oriented importance learning (SOIL), for high-dimensional regression from a sparse linear modeling perspective by taking into account the variable selection uncertainty via the use of a sensible model weighting. The SOIL method is theoretically shown to have the inclusion/exclusion property: When the model weights are properly around the true model, the SOIL importance can well separate the variables in the true model from the rest. In particular, even if the signal is weak, SOIL rarely gives variables not in the true model significantly higher important values than those in the true model. Extensive simulations in several illustrative settings and real-data examples with guided simulations show desirable properties of the SOIL importance in contrast to other importance measures. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1797-1812 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1377080 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1377080 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1797-1812 Template-Type: ReDIF-Article 1.0 Author-Name: Yacouba Boubacar Maïnassara Author-X-Name-First: Yacouba Author-X-Name-Last: Boubacar Maïnassara Author-Name: Bruno Saussereau Author-X-Name-First: Bruno Author-X-Name-Last: Saussereau Title: Diagnostic Checking in Multivariate ARMA Models With Dependent Errors Using Normalized Residual Autocorrelations Abstract: In this paper, we derive the asymptotic distribution of normalized residual empirical autocovariances and autocorrelations under weak assumptions on the noise. We propose new portmanteau statistics for vector autoregressive moving average models with uncorrelated but nonindependent innovations by using a self-normalization approach. We establish the asymptotic distribution of the proposed statistics. This asymptotic distribution is quite different from the usual chi-squared approximation used under the independent and identically distributed assumption on the noise, or the weighted sum of independent chi-squared random variables obtained under nonindependent innovations. A set of Monte Carlo experiments and an application to the daily returns of the CAC40 is presented. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1813-1827 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1380030 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1380030 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1813-1827 Template-Type: ReDIF-Article 1.0 Author-Name: Albert E. Parker Author-X-Name-First: Albert E. Author-X-Name-Last: Parker Author-Name: Betsey Pitts Author-X-Name-First: Betsey Author-X-Name-Last: Pitts Author-Name: Lindsey Lorenz Author-X-Name-First: Lindsey Author-X-Name-Last: Lorenz Author-Name: Philip S. Stewart Author-X-Name-First: Philip S. Author-X-Name-Last: Stewart Title: Polynomial Accelerated Solutions to a Large Gaussian Model for Imaging Biofilms: In Theory and Finite Precision Abstract: Three-dimensional confocal scanning laser microscope images offer dramatic visualizations of living biofilms before and after interventions. Here, we use confocal microscopy to study the effect of a treatment over time that causes a biofilm to swell and contract due to osmotic pressure changes. From these data (the video is provided in the supplementary materials), our goal is to reconstruct biofilm surfaces, to estimate the effect of the treatment on the biofilm’s volume, and to quantify the related uncertainties. We formulate the associated massive linear Bayesian inverse problem and then solve it using iterative samplers from large multivariate Gaussians that exploit well-established polynomial acceleration techniques from numerical linear algebra. Because of a general equivalence with linear solvers, these polynomial accelerated iterative samplers have known convergence rates, stopping criteria, and perform well in finite precision. An explicit algorithm is provided, for the first time, for an iterative sampler that is accelerated by the synergistic implementation of preconditioned conjugate gradient and Chebyshev polynomials. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1431-1442 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1409121 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1409121 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1431-1442 Template-Type: ReDIF-Article 1.0 Author-Name: Simon Mak Author-X-Name-First: Simon Author-X-Name-Last: Mak Author-Name: Chih-Li Sung Author-X-Name-First: Chih-Li Author-X-Name-Last: Sung Author-Name: Xingjian Wang Author-X-Name-First: Xingjian Author-X-Name-Last: Wang Author-Name: Shiang-Ting Yeh Author-X-Name-First: Shiang-Ting Author-X-Name-Last: Yeh Author-Name: Yu-Hung Chang Author-X-Name-First: Yu-Hung Author-X-Name-Last: Chang Author-Name: V. Roshan Joseph Author-X-Name-First: V. Roshan Author-X-Name-Last: Joseph Author-Name: Vigor Yang Author-X-Name-First: Vigor Author-X-Name-Last: Yang Author-Name: C. F. Jeff Wu Author-X-Name-First: C. F. Jeff Author-X-Name-Last: Wu Title: An Efficient Surrogate Model for Emulation and Physics Extraction of Large Eddy Simulations Abstract: In the quest for advanced propulsion and power-generation systems, high-fidelity simulations are too computationally expensive to survey the desired design space, and a new design methodology is needed that combines engineering physics, computer simulations, and statistical modeling. In this article, we propose a new surrogate model that provides efficient prediction and uncertainty quantification of turbulent flows in swirl injectors with varying geometries, devices commonly used in many engineering applications. The novelty of the proposed method lies in the incorporation of known physical properties of the fluid flow as simplifying assumptions for the statistical model. In view of the massive simulation data at hand, which is on the order of hundreds of gigabytes, these assumptions allow for accurate flow predictions in around an hour of computation time. To contrast, existing flow emulators which forgo such simplifications may require more computation time for training and prediction than is needed for conducting the simulation itself. Moreover, by accounting for coupling mechanisms between flow variables, the proposed model can jointly reduce prediction uncertainty and extract useful flow physics, which can then be used to guide further investigations. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1443-1456 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1409123 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1409123 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1443-1456 Template-Type: ReDIF-Article 1.0 Author-Name: Bhuvanesh Pareek Author-X-Name-First: Bhuvanesh Author-X-Name-Last: Pareek Author-Name: Pulak Ghosh Author-X-Name-First: Pulak Author-X-Name-Last: Ghosh Author-Name: Hugh N. Wilson Author-X-Name-First: Hugh N. Author-X-Name-Last: Wilson Author-Name: Emma K. Macdonald Author-X-Name-First: Emma K. Author-X-Name-Last: Macdonald Author-Name: Paul Baines Author-X-Name-First: Paul Author-X-Name-Last: Baines Title: Tracking the Impact of Media on Voter Choice in Real Time: A Bayesian Dynamic Joint Model Abstract: Commonly used methods of evaluating the impact of marketing communications during political elections struggle to account for respondents’ exposures to these communications due to the problems associated with recall bias. In addition, they completely fail to account for the impact of mediated or earned communications, such as newspaper articles or television news, that are typically not within the control of the advertising party, nor are they effectively able to monitor consumers’ perceptual responses over time. This study based on a new data collection technique using cell-phone text messaging (called real-time experience tracking or RET) offers the potential to address these weaknesses. We propose an RET-based model of the impact of communications and apply it to a unique choice situation: voting behavior during the 2010 UK general election, which was dominated by three political parties. We develop a Bayesian zero-inflated dynamic multinomial choice model that enables the joint modeling of: the interplay and dynamics associated with the individual voter's choice intentions over time, actual vote, and the heterogeneity in the exposure to marketing communications over time. Results reveal the differential impact over time of paid and earned media, demonstrate a synergy between the two, and show the particular importance of exposure valence and not just frequency, contrary to the predominant practitioner emphasis on share-of-voice metrics. Results also suggest that while earned media have a reducing impact on voting intentions as the final choice approaches, their valence continues to influence the final vote: a difference between drivers of intentions and behavior that implies that exposure valence remains critically important close to the final brand choice. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1457-1475 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1419134 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1419134 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1457-1475 Template-Type: ReDIF-Article 1.0 Author-Name: Xueying Tang Author-X-Name-First: Xueying Author-X-Name-Last: Tang Author-Name: Malay Ghosh Author-X-Name-First: Malay Author-X-Name-Last: Ghosh Author-Name: Neung Soo Ha Author-X-Name-First: Neung Soo Author-X-Name-Last: Ha Author-Name: Joseph Sedransk Author-X-Name-First: Joseph Author-X-Name-Last: Sedransk Title: Modeling Random Effects Using Global–Local Shrinkage Priors in Small Area Estimation Abstract: Small area estimation is becoming increasingly popular for survey statisticians. One very important program is Small Area Income and Poverty Estimation undertaken by the United States Bureau of the Census, which aims at providing estimates related to income and poverty based on American Community Survey data at the state level and even at lower levels of geography. This article introduces global–local (GL) shrinkage priors for random effects in small area estimation to capture wide area level variation when the number of small areas is very large. These priors employ two levels of parameters, global and local parameters, to express variances of area-specific random effects so that both small and large random effects can be captured properly. We show via simulations and data analysis that use of the GL priors can improve estimation results in most cases. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1476-1489 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2017.1419135 File-URL: http://hdl.handle.net/10.1080/01621459.2017.1419135 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1476-1489 Template-Type: ReDIF-Article 1.0 Author-Name: Alexander D. Bolton Author-X-Name-First: Alexander D. Author-X-Name-Last: Bolton Author-Name: Nicholas A. Heard Author-X-Name-First: Nicholas A. Author-X-Name-Last: Heard Title: Malware Family Discovery Using Reversible Jump MCMC Sampling of Regimes Abstract: Malware is computer software that has either been designed or modified with malicious intent. Hundreds of thousands of new malware threats appear on the internet each day. This is made possible through reuse of known exploits in computer systems that have not been fully eradicated; existing pieces of malware can be trivially modified and combined to create new malware, which is unknown to anti-virus programs. Finding new software with similarities to known malware is therefore an important goal in cyber-security. A dynamic instruction trace of a piece of software is the sequence of machine language instructions it generates when executed. Statistical analysis of a dynamic instruction trace can help reverse engineers infer the purpose and origin of the software that generated it. Instruction traces have been successfully modeled as simple Markov chains, but empirically there are change points in the structure of the traces, with recurring regimes of transition patterns. Here, reversible jump Markov chain Monte Carlo for change point detection is extended to incorporate regime-switching, allowing regimes to be inferred from malware instruction traces. A similarity measure for malware programs based on regime matching is then used to infer the originating families, leading to compelling performance results. Journal: Journal of the American Statistical Association Pages: 1490-1502 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2018.1423984 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1423984 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1490-1502 Template-Type: ReDIF-Article 1.0 Author-Name: Gen Li Author-X-Name-First: Gen Author-X-Name-Last: Li Author-Name: Jianhua Z. Huang Author-X-Name-First: Jianhua Z. Author-X-Name-Last: Huang Author-Name: Haipeng Shen Author-X-Name-First: Haipeng Author-X-Name-Last: Shen Title: To Wait or Not to Wait: Two-Way Functional Hazards Model for Understanding Waiting in Call Centers Abstract: Telephone call centers offer a convenient communication channel between businesses and their customers. Efficient management of call centers needs accurate modeling of customer waiting behavior, which contains important information about customer patience (how long a customer is willing to wait) and service quality (how long a customer needs to wait to get served). Hazard functions offer dynamic characterization of customer waiting behavior, and provide critical inputs for agent scheduling. Motivated by this application, we develop a two-way functional hazards (tF-Hazards) model to study customer waiting behavior as a function of two timescales, waiting duration and the time of day that a customer calls in. The model stems from a two-way piecewise constant hazard function, and imposes low-rank structure and smoothness on the hazard rates to enhance interpretability. We exploit an alternating direction method of multipliers algorithm to optimize a penalized likelihood function of the model. We carefully analyze the data from a U.S. Bank call center, and provide informative insights about customer patience and service quality patterns along waiting time and across different times of a day. The findings provide primitive inputs for call center agent staffing and scheduling, as well as for call center practitioners to understand the effect of system protocols on customer waiting behavior. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1503-1514 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2018.1423985 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1423985 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1503-1514 Template-Type: ReDIF-Article 1.0 Author-Name: Abhra Sarkar Author-X-Name-First: Abhra Author-X-Name-Last: Sarkar Author-Name: Jonathan Chabout Author-X-Name-First: Jonathan Author-X-Name-Last: Chabout Author-Name: Joshua Jones Macopson Author-X-Name-First: Joshua Jones Author-X-Name-Last: Macopson Author-Name: Erich D. Jarvis Author-X-Name-First: Erich D. Author-X-Name-Last: Jarvis Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Bayesian Semiparametric Mixed Effects Markov Models With Application to Vocalization Syntax Abstract: Studying the neurological, genetic, and evolutionary basis of human vocal communication mechanisms using animal vocalization models is an important field of neuroscience. The datasets typically comprise structured sequences of syllables or “songs” produced by animals from different genotypes under different social contexts. It has been difficult to come up with sophisticated statistical methods that appropriately model animal vocal communication syntax. We address this need by developing a novel Bayesian semiparametric framework for inference in such datasets. Our approach is built on a novel class of mixed effects Markov transition models for the songs that accommodate exogenous influences of genotype and context as well as animal-specific heterogeneity. Crucial advantages of the proposed approach include its ability to provide insights into key scientific queries related to global and local influences of the exogenous predictors on the transition dynamics via automated tests of hypotheses. The methodology is illustrated using simulation experiments and the aforementioned motivating application in neuroscience. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1515-1527 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2018.1423986 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1423986 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1515-1527 Template-Type: ReDIF-Article 1.0 Author-Name: Yingbo Li Author-X-Name-First: Yingbo Author-X-Name-Last: Li Author-Name: Merlise A. Clyde Author-X-Name-First: Merlise A. Author-X-Name-Last: Clyde Title: Mixtures of g-Priors in Generalized Linear Models Abstract: Mixtures of Zellner’s g-priors have been studied extensively in linear models and have been shown to have numerous desirable properties for Bayesian variable selection and model averaging. Several extensions of g-priors to generalized linear models (GLMs) have been proposed in the literature; however, the choice of prior distribution of g and resulting properties for inference have received considerably less attention. In this article, we unify mixtures of g-priors in GLMs by assigning the truncated Compound Confluent Hypergeometric (tCCH) distribution to 1/(1 + g), which encompasses as special cases several mixtures of g-priors in the literature, such as the hyper-g, Beta-prime, truncated Gamma, incomplete inverse-Gamma, benchmark, robust, hyper-g/n, and intrinsic priors. Through an integrated Laplace approximation, the posterior distribution of 1/(1 + g) is in turn a tCCH distribution, and approximate marginal likelihoods are thus available analytically, leading to “Compound Hypergeometric Information Criteria” for model selection. We discuss the local geometric properties of the g-prior in GLMs and show how the desiderata for model selection proposed by Bayarri et al., such as asymptotic model selection consistency, intrinsic consistency, and measurement invariance may be used to justify the prior and specific choices of the hyper parameters. We illustrate inference using these priors and contrast them to other approaches via simulation and real data examples. The methodology is implemented in the R package BAS and freely available on CRAN. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1828-1845 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2018.1469992 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469992 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1828-1845 Template-Type: ReDIF-Article 1.0 Author-Name: Cheng-Han Yu Author-X-Name-First: Cheng-Han Author-X-Name-Last: Yu Author-Name: Raquel Prado Author-X-Name-First: Raquel Author-X-Name-Last: Prado Author-Name: Hernando Ombao Author-X-Name-First: Hernando Author-X-Name-Last: Ombao Author-Name: Daniel Rowe Author-X-Name-First: Daniel Author-X-Name-Last: Rowe Title: A Bayesian Variable Selection Approach Yields Improved Detection of Brain Activation From Complex-Valued fMRI Abstract: Voxel functional magnetic resonance imaging (fMRI) time courses are complex-valued signals giving rise to magnitude and phase data. Nevertheless, most studies use only the magnitude signals and thus discard half of the data that could potentially contain important information. Methods that make use of complex-valued fMRI (CV-fMRI) data have been shown to lead to superior power in detecting active voxels when compared to magnitude-only methods, particularly for small signal-to-noise ratios (SNRs). We present a new Bayesian variable selection approach for detecting brain activation at the voxel level from CV-fMRI data. We develop models with complex-valued spike-and-slab priors on the activation parameters that are able to combine the magnitude and phase information. We present a complex-valued EM variable selection algorithm that leads to fast detection at the voxel level in CV-fMRI slices and also consider full posterior inference via Markov chain Monte Carlo (MCMC). Model performance is illustrated through extensive simulation studies, including the analysis of physically based simulated CV-fMRI slices. Finally, we use the complex-valued Bayesian approach to detect active voxels in human CV-fMRI from a healthy individual who performed unilateral finger tapping in a designed experiment. The proposed approach leads to improved detection of activation in the expected motor-related brain regions and produces fewer false positive results than other methods for CV-fMRI. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1395-1410 Issue: 524 Volume: 113 Year: 2018 Month: 10 X-DOI: 10.1080/01621459.2018.1476244 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1476244 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1395-1410 Template-Type: ReDIF-Article 1.0 Author-Name: Daniela Castro-Camilo Author-X-Name-First: Daniela Author-X-Name-Last: Castro-Camilo Author-Name: Raphaël Huser Author-X-Name-First: Raphaël Author-X-Name-Last: Huser Title: Local Likelihood Estimation of Complex Tail Dependence Structures, Applied to U.S. Precipitation Extremes Abstract: To disentangle the complex nonstationary dependence structure of precipitation extremes over the entire contiguous United States (U.S.), we propose a flexible local approach based on factor copula models. Our subasymptotic spatial modeling framework yields nontrivial tail dependence structures, with a weakening dependence strength as events become more extreme; a feature commonly observed with precipitation data but not accounted for in classical asymptotic extreme-value models. To estimate the local extremal behavior, we fit the proposed model in small regional neighborhoods to high threshold exceedances, under the assumption of local stationarity, which allows us to gain in flexibility. By adopting a local censored likelihood approach, we make inference on a fine spatial grid, and we perform local estimation by taking advantage of distributed computing resources and the embarrassingly parallel nature of this estimation procedure. The local model is efficiently fitted at all grid points, and uncertainty is measured using a block bootstrap procedure. We carry out an extensive simulation study to show that our approach can adequately capture complex, nonstationary dependencies, in addition, our study of U.S. winter precipitation data reveals interesting differences in local tail structures over space, which has important implications on regional risk assessment of extreme precipitation events. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1037-1054 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1647842 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1647842 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1037-1054 Template-Type: ReDIF-Article 1.0 Author-Name: Douglas R. Wilson Author-X-Name-First: Douglas R. Author-X-Name-Last: Wilson Author-Name: Chong Jin Author-X-Name-First: Chong Author-X-Name-Last: Jin Author-Name: Joseph G. Ibrahim Author-X-Name-First: Joseph G. Author-X-Name-Last: Ibrahim Author-Name: Wei Sun Author-X-Name-First: Wei Author-X-Name-Last: Sun Title: ICeD-T Provides Accurate Estimates of Immune Cell Abundance in Tumor Samples by Allowing for Aberrant Gene Expression Patterns Abstract: Immunotherapies have attracted lots of research interests recently. The need to understand the underlying mechanisms of immunotherapies and to develop precision immunotherapy regimens has spurred great interest in characterizing immune cell composition within the tumor microenvironment. Several methods have been developed to estimate immune cell composition using gene expression data from bulk tumor samples. However, these methods are not flexible enough to handle aberrant patterns of gene expression data, for example, inconsistent cell type-specific gene expression between purified reference samples and tumor samples. We propose a novel statistical method for expression deconvolution called immune cell deconvolution in tumor tissues (ICeD-T). ICeD-T automatically identifies aberrant genes whose expression are inconsistent with the deconvolution model and down-weights their contributions to cell type abundance estimates. We evaluated the performance of ICeD-T versus existing methods in simulation studies and several real data analyses. ICeD-T displayed comparable or superior performance to these competing methods. Applying these methods to assess the relationship between immunotherapy response and immune cell composition, ICeD-T is able to identify significant associations that are missed by its competitors. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1055-1065 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1654874 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1654874 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1055-1065 Template-Type: ReDIF-Article 1.0 Author-Name: Qian Guan Author-X-Name-First: Qian Author-X-Name-Last: Guan Author-Name: Brian J. Reich Author-X-Name-First: Brian J. Author-X-Name-Last: Reich Author-Name: Eric B. Laber Author-X-Name-First: Eric B. Author-X-Name-Last: Laber Author-Name: Dipankar Bandyopadhyay Author-X-Name-First: Dipankar Author-X-Name-Last: Bandyopadhyay Title: Bayesian Nonparametric Policy Search With Application to Periodontal Recall Intervals Abstract: Tooth loss from periodontal disease is a major public health burden in the United States. Standard clinical practice is to recommend a dental visit every six months; however, this practice is not evidence-based, and poor dental outcomes and increasing dental insurance premiums indicate room for improvement. We consider a tailored approach that recommends recall time based on patient characteristics and medical history to minimize disease progression without increasing resource expenditures. We formalize this method as a dynamic treatment regime which comprises a sequence of decisions, one per stage of intervention, that follow a decision rule which maps current patient information to a recommendation for their next visit time. The dynamics of periodontal health, visit frequency, and patient compliance are complex, yet the estimated optimal regime must be interpretable to domain experts if it is to be integrated into clinical practice. We combine nonparametric Bayesian dynamics modeling with policy-search algorithms to estimate the optimal dynamic treatment regime within an interpretable class of regimes. Both simulation experiments and application to a rich database of electronic dental records from the HealthPartners HMO shows that our proposed method leads to better dental health without increasing the average recommended recall time relative to competing methods. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1066-1078 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1660169 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1660169 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1066-1078 Template-Type: ReDIF-Article 1.0 Author-Name: Ryan Sun Author-X-Name-First: Ryan Author-X-Name-Last: Sun Author-Name: Xihong Lin Author-X-Name-First: Xihong Author-X-Name-Last: Lin Title: Genetic Variant Set-Based Tests Using the Generalized Berk–Jones Statistic With Application to a Genome-Wide Association Study of Breast Cancer Abstract: Studying the effects of groups of single nucleotide polymorphisms (SNPs), as in a gene, genetic pathway, or network, can provide novel insight into complex diseases such as breast cancer, uncovering new genetic associations and augmenting the information that can be gleaned from studying SNPs individually. Common challenges in set-based genetic association testing include weak effect sizes, correlation between SNPs in a SNP-set, and scarcity of signals, with individual SNP effects often ranging from extremely sparse to moderately sparse in number. Motivated by these challenges, we propose the Generalized Berk–Jones (GBJ) test for the association between a SNP-set and outcome. The GBJ extends the Berk–Jones statistic by accounting for correlation among SNPs, and it provides advantages over the Generalized Higher Criticism test when signals in a SNP-set are moderately sparse. We also provide an analytic p-value calculation for SNP-sets of any finite size, and we develop an omnibus statistic that is robust to the degree of signal sparsity. An additional advantage of our work is the ability to conduct inference using individual SNP summary statistics from a genome-wide association study (GWAS). We evaluate the finite sample performance of the GBJ through simulation and apply the method to identify breast cancer risk genes in a GWAS conducted by the Cancer Genetic Markers of Susceptibility Consortium. Our results suggest evidence of association between FGFR2 and breast cancer and also identify other potential susceptibility genes, complementing conventional SNP-level analysis. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1079-1091 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1660170 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1660170 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1079-1091 Template-Type: ReDIF-Article 1.0 Author-Name: Kenichiro McAlinn Author-X-Name-First: Kenichiro Author-X-Name-Last: McAlinn Author-Name: Knut Are Aastveit Author-X-Name-First: Knut Are Author-X-Name-Last: Aastveit Author-Name: Jouchi Nakajima Author-X-Name-First: Jouchi Author-X-Name-Last: Nakajima Author-Name: Mike West Author-X-Name-First: Mike Author-X-Name-Last: West Title: Multivariate Bayesian Predictive Synthesis in Macroeconomic Forecasting Abstract: We present new methodology and a case study in use of a class of Bayesian predictive synthesis (BPS) models for multivariate time series forecasting. This extends the foundational BPS framework to the multivariate setting, with detailed application in the topical and challenging context of multistep macroeconomic forecasting in a monetary policy setting. BPS evaluates—sequentially and adaptively over time—varying forecast biases and facets of miscalibration of individual forecast densities for multiple time series, and—critically—their time-varying inter-dependencies. We define BPS methodology for a new class of dynamic multivariate latent factor models implied by BPS theory. Structured dynamic latent factor BPS is here motivated by the application context—sequential forecasting of multiple U.S. macroeconomic time series with forecasts generated from several traditional econometric time series models. The case study highlights the potential of BPS to improve of forecasts of multiple series at multiple forecast horizons, and its use in learning dynamic relationships among forecasting models or agents.Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1092-1110 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1660171 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1660171 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1092-1110 Template-Type: ReDIF-Article 1.0 Author-Name: Yawen Guan Author-X-Name-First: Yawen Author-X-Name-Last: Guan Author-Name: Margaret C. Johnson Author-X-Name-First: Margaret C. Author-X-Name-Last: Johnson Author-Name: Matthias Katzfuss Author-X-Name-First: Matthias Author-X-Name-Last: Katzfuss Author-Name: Elizabeth Mannshardt Author-X-Name-First: Elizabeth Author-X-Name-Last: Mannshardt Author-Name: Kyle P. Messier Author-X-Name-First: Kyle P. Author-X-Name-Last: Messier Author-Name: Brian J. Reich Author-X-Name-First: Brian J. Author-X-Name-Last: Reich Author-Name: Joon J. Song Author-X-Name-First: Joon J. Author-X-Name-Last: Song Title: Fine-Scale Spatiotemporal Air Pollution Analysis Using Mobile Monitors on Google Street View Vehicles Abstract: People are increasingly concerned with understanding their personal environment, including possible exposure to harmful air pollutants. To make informed decisions on their day-to-day activities, they are interested in real-time information on a localized scale. Publicly available, fine-scale, high-quality air pollution measurements acquired using mobile monitors represent a paradigm shift in measurement technologies. A methodological framework utilizing these increasingly fine-scale measurements to provide real-time air pollution maps and short-term air quality forecasts on a fine-resolution spatial scale could prove to be instrumental in increasing public awareness and understanding. The Google Street View study provides a unique source of data with spatial and temporal complexities, with the potential to provide information about commuter exposure and hot spots within city streets with high traffic. We develop a computationally efficient spatiotemporal model for these data and use the model to make short-term forecasts and high-resolution maps of current air pollution levels. We also show via an experiment that mobile networks can provide more nuanced information than an equally sized fixed-location network. This modeling framework has important real-world implications in understanding citizens’ personal environments, as data production and real-time availability continue to be driven by the ongoing development and improvement of mobile measurement technologies. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1111-1124 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1665526 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1665526 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1111-1124 Template-Type: ReDIF-Article 1.0 Author-Name: Naim U. Rashid Author-X-Name-First: Naim U. Author-X-Name-Last: Rashid Author-Name: Quefeng Li Author-X-Name-First: Quefeng Author-X-Name-Last: Li Author-Name: Jen Jen Yeh Author-X-Name-First: Jen Jen Author-X-Name-Last: Yeh Author-Name: Joseph G. Ibrahim Author-X-Name-First: Joseph G. Author-X-Name-Last: Ibrahim Title: Modeling Between-Study Heterogeneity for Improved Replicability in Gene Signature Selection and Clinical Prediction Abstract: In the genomic era, the identification of gene signatures associated with disease is of significant interest. Such signatures are often used to predict clinical outcomes in new patients and aid clinical decision-making. However, recent studies have shown that gene signatures are often not replicable. This occurrence has practical implications regarding the generalizability and clinical applicability of such signatures. To improve replicability, we introduce a novel approach to select gene signatures from multiple datasets whose effects are consistently nonzero and account for between-study heterogeneity. We build our model upon some rank-based quantities, facilitating integration over different genomic datasets. A high-dimensional penalized generalized linear mixed model is used to select gene signatures and address data heterogeneity. We compare our method to some commonly used strategies that select gene signatures ignoring between-study heterogeneity. We provide asymptotic results justifying the performance of our method and demonstrate its advantage in the presence of heterogeneity through thorough simulation studies. Lastly, we motivate our method through a case study subtyping pancreatic cancer patients from four gene expression studies. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1125-1138 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1671197 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1671197 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1125-1138 Template-Type: ReDIF-Article 1.0 Author-Name: Lorin Crawford Author-X-Name-First: Lorin Author-X-Name-Last: Crawford Author-Name: Anthea Monod Author-X-Name-First: Anthea Author-X-Name-Last: Monod Author-Name: Andrew X. Chen Author-X-Name-First: Andrew X. Author-X-Name-Last: Chen Author-Name: Sayan Mukherjee Author-X-Name-First: Sayan Author-X-Name-Last: Mukherjee Author-Name: Raúl Rabadán Author-X-Name-First: Raúl Author-X-Name-Last: Rabadán Title: Predicting Clinical Outcomes in Glioblastoma: An Application of Topological and Functional Data Analysis Abstract: Glioblastoma multiforme (GBM) is an aggressive form of human brain cancer that is under active study in the field of cancer biology. Its rapid progression and the relative time cost of obtaining molecular data make other readily available forms of data, such as images, an important resource for actionable measures in patients. Our goal is to use information given by medical images taken from GBM patients in statistical settings. To do this, we design a novel statistic—the smooth Euler characteristic transform (SECT)—that quantifies magnetic resonance images of tumors. Due to its well-defined inner product structure, the SECT can be used in a wider range of functional and nonparametric modeling approaches than other previously proposed topological summary statistics. When applied to a cohort of GBM patients, we find that the SECT is a better predictor of clinical outcomes than both existing tumor shape quantifications and common molecular assays. Specifically, we demonstrate that SECT features alone explain more of the variance in GBM patient survival than gene expression, volumetric features, and morphometric features. The main takeaways from our findings are thus 2-fold. First, they suggest that images contain valuable information that can play an important role in clinical prognosis and other medical decisions. Second, they show that the SECT is a viable tool for the broader study of medical imaging informatics. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1139-1150 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1671198 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1671198 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1139-1150 Template-Type: ReDIF-Article 1.0 Author-Name: Amanda F. Mejia Author-X-Name-First: Amanda F. Author-X-Name-Last: Mejia Author-Name: Mary Beth Nebel Author-X-Name-First: Mary Beth Author-X-Name-Last: Nebel Author-Name: Yikai Wang Author-X-Name-First: Yikai Author-X-Name-Last: Wang Author-Name: Brian S. Caffo Author-X-Name-First: Brian S. Author-X-Name-Last: Caffo Author-Name: Ying Guo Author-X-Name-First: Ying Author-X-Name-Last: Guo Title: Template Independent Component Analysis: Targeted and Reliable Estimation of Subject-level Brain Networks Using Big Data Population Priors Abstract: Large brain imaging databases contain a wealth of information on brain organization in the populations they target, and on individual variability. While such databases have been used to study group-level features of populations directly, they are currently underutilized as a resource to inform single-subject analysis. Here, we propose leveraging the information contained in large functional magnetic resonance imaging (fMRI) databases by establishing population priors to employ in an empirical Bayesian framework. We focus on estimation of brain networks as source signals in independent component analysis (ICA). We formulate a hierarchical “template” ICA model where source signals—including known population brain networks and subject-specific signals—are represented as latent variables. For estimation, we derive an expectation–maximization (EM) algorithm having an explicit solution. However, as this solution is computationally intractable, we also consider an approximate subspace algorithm and a faster two-stage approach. Through extensive simulation studies, we assess performance of both methods and compare with dual regression, a popular but ad-hoc method. The two proposed algorithms have similar performance, and both dramatically outperform dual regression. We also conduct a reliability study utilizing the Human Connectome Project and find that template ICA achieves substantially better performance than dual regression, achieving 75–250% higher intra-subject reliability. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1151-1177 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1679638 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1679638 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1151-1177 Template-Type: ReDIF-Article 1.0 Author-Name: Carles Bretó Author-X-Name-First: Carles Author-X-Name-Last: Bretó Author-Name: Edward L. Ionides Author-X-Name-First: Edward L. Author-X-Name-Last: Ionides Author-Name: Aaron A. King Author-X-Name-First: Aaron A. Author-X-Name-Last: King Title: Panel Data Analysis via Mechanistic Models Abstract: Panel data, also known as longitudinal data, consist of a collection of time series. Each time series, which could itself be multivariate, comprises a sequence of measurements taken on a distinct unit. Mechanistic modeling involves writing down scientifically motivated equations describing the collection of dynamic systems giving rise to the observations on each unit. A defining characteristic of panel systems is that the dynamic interaction between units should be negligible. Panel models therefore consist of a collection of independent stochastic processes, generally linked through shared parameters while also having unit-specific parameters. To give the scientist flexibility in model specification, we are motivated to develop a framework for inference on panel data permitting the consideration of arbitrary nonlinear, partially observed panel models. We build on iterated filtering techniques that provide likelihood-based inference on nonlinear partially observed Markov process models for time series data. Our methodology depends on the latent Markov process only through simulation; this plug-and-play property ensures applicability to a large class of models. We demonstrate our methodology on a toy example and two epidemiological case studies. We address inferential and computational issues arising due to the combination of model complexity and dataset size. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1178-1188 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1604367 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604367 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1178-1188 Template-Type: ReDIF-Article 1.0 Author-Name: Christian H. Weiß Author-X-Name-First: Christian H. Author-X-Name-Last: Weiß Title: Distance-Based Analysis of Ordinal Data and Ordinal Time Series Abstract: The dissimilarity of ordinal categories can be expressed with a distance measure. A unified approach relying on expected distances is proposed to obtain well-interpretable measures of location, dispersion, or symmetry of random variables, as well as measures of serial dependence within a given process. For special types of distance, these analytic tools lead to known approaches for ordinal or real-valued random variables. We also analyze the sample counterparts of the proposed measures and derive asymptotic results for practically important cases in ordinal data and time series analysis. Two real applications about the economic situation in Germany and the credit rating of European countries are presented. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1189-1200 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1604370 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604370 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1189-1200 Template-Type: ReDIF-Article 1.0 Author-Name: Chengchun Shi Author-X-Name-First: Chengchun Author-X-Name-Last: Shi Author-Name: Wenbin Lu Author-X-Name-First: Wenbin Author-X-Name-Last: Lu Author-Name: Rui Song Author-X-Name-First: Rui Author-X-Name-Last: Song Title: A Sparse Random Projection-Based Test for Overall Qualitative Treatment Effects Abstract: In contrast to the classical “one-size-fits-all” approach, precision medicine proposes the customization of individualized treatment regimes to account for patients’ heterogeneity in response to treatments. Most of existing works in the literature focused on estimating optimal individualized treatment regimes. However, there has been less attention devoted to hypothesis testing regarding the existence of overall qualitative treatment effects, especially when there are a large number of prognostic covariates. When covariates do not have qualitative treatment effects, the optimal treatment regime will assign the same treatment to all patients regardless of their covariate values. In this article, we consider testing the overall qualitative treatment effects of patients’ prognostic covariates in a high-dimensional setting. We propose a sample splitting method to construct the test statistic, based on a nonparametric estimator of the contrast function. When the dimension of covariates is large, we construct the test based on sparse random projections of covariates into a low-dimensional space. We prove the consistency of our test statistic. In the regular cases, we show the asymptotic power function of our test statistic is asymptotically the same as the “oracle” test statistic which is constructed based on the “optimal” projection matrix. Simulation studies and real data applications validate our theoretical findings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1201-1213 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1604368 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604368 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1201-1213 Template-Type: ReDIF-Article 1.0 Author-Name: Valentina Corradi Author-X-Name-First: Valentina Author-X-Name-Last: Corradi Author-Name: Walter Distaso Author-X-Name-First: Walter Author-X-Name-Last: Distaso Author-Name: Marcelo Fernandes Author-X-Name-First: Marcelo Author-X-Name-Last: Fernandes Title: Testing for Jump Spillovers Without Testing for Jumps Abstract: This article develops statistical tools for testing conditional independence among the jump components of the daily quadratic variation, which we estimate using intraday data. To avoid sequential bias distortion, we do not pretest for the presence of jumps. If the null is true, our test statistic based on daily integrated jumps weakly converges to a Gaussian random variable if both assets have jumps. If instead at least one asset has no jumps, then the statistic approaches zero in probability. We show how to compute asymptotically valid bootstrap-based critical values that result in a consistent test with asymptotic size equal to or smaller than the nominal size. Empirically, we study jump linkages between US futures and equity index markets. We find not only strong evidence of jump cross-excitation between the SPDR exchange-traded fund and E-mini futures on the S&P 500 index, but also that integrated jumps in the E-mini futures during the overnight period carry relevant information. Supplementary materials for this article are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1214-1226 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1609971 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1609971 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1214-1226 Template-Type: ReDIF-Article 1.0 Author-Name: Shan Luo Author-X-Name-First: Shan Author-X-Name-Last: Luo Author-Name: Zehua Chen Author-X-Name-First: Zehua Author-X-Name-Last: Chen Title: Feature Selection by Canonical Correlation Search in High-Dimensional Multiresponse Models With Complex Group Structures Abstract: High-dimensional multiresponse models with complex group structures in both the response variables and the covariates arise from current researches in important fields such as genetics and medicine. However, no enough research has been done on such models. One of a few researches, if not the only one, is the article by Li, Nan, and Zhu where the sparse group Lasso approach is extended to such models. In this article, we propose a novel approach named the sequential canonical correlation search (SCCS) procedure. In the SCCS procedure, the nonzero group by group blocks of regression coefficients are searched stepwise using a canonical correlation measure. Each step of the procedure consists of a block selection and a sparsity identification. The model selection criterion, EBIC, is used as the stopping rule of the procedure. We establish the selection consistency of the SCCS procedure and conduct simulation studies for the comparison of existing methods. The SCCS procedure has two advantages over the sparse grouped Lasso method: (i) it is more accurate in the identification of nonzero coefficient blocks and their nonzero entries, and (ii) its implementation is not limited by the dimensionality of the models and requires much less computation. A real example in genetic studies is also considered. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1227-1235 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1609972 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1609972 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1227-1235 Template-Type: ReDIF-Article 1.0 Author-Name: Yin Xia Author-X-Name-First: Yin Author-X-Name-Last: Xia Author-Name: T. Tony Cai Author-X-Name-First: T. Tony Author-X-Name-Last: Cai Author-Name: Wenguang Sun Author-X-Name-First: Wenguang Author-X-Name-Last: Sun Title: GAP: A General Framework for Information Pooling in Two-Sample Sparse Inference Abstract: This article develops a general framework for exploiting the sparsity information in two-sample multiple testing problems. We propose to first construct a covariate sequence, in addition to the usual primary test statistics, to capture the sparsity structure, and then incorporate the auxiliary covariates in inference via a three-step algorithm consisting of grouping, adjusting and pooling (GAP). The GAP procedure provides a simple and effective framework for information pooling. An important advantage of GAP is its capability of handling various dependence structures such as those arise from high-dimensional linear regression, differential correlation analysis, and differential network analysis. We establish general conditions under which GAP is asymptotically valid for false discovery rate control, and show that these conditions are fulfilled in a range of settings, including testing multivariate normal means, high-dimensional linear regression, differential covariance or correlation matrices, and Gaussian graphical models. Numerical results demonstrate that existing methods can be significantly improved by the proposed framework. The GAP procedure is illustrated using a breast cancer study for identifying gene–gene interactions. Journal: Journal of the American Statistical Association Pages: 1236-1250 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1611585 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1611585 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1236-1250 Template-Type: ReDIF-Article 1.0 Author-Name: Jieli Shen Author-X-Name-First: Jieli Author-X-Name-Last: Shen Author-Name: Regina Y. Liu Author-X-Name-First: Regina Y. Author-X-Name-Last: Liu Author-Name: Min-ge Xie Author-X-Name-First: Min-ge Author-X-Name-Last: Xie Title: iFusion: Individualized Fusion Learning Abstract: Inferences from different data sources can often be fused together, a process referred to as “fusion learning,” to yield more powerful findings than those from individual data sources alone. Effective fusion learning approaches are in growing demand as increasing number of data sources have become easily available in this big data era. This article proposes a new fusion learning approach, called “iFusion,” for drawing efficient individualized inference by fusing learnings from relevant data sources. Specifically, iFusion (i) summarizes inferences from individual data sources as individual confidence distributions (CDs); (ii) forms a clique of individuals that bear relevance to the target individual and then combines the CDs from those relevant individuals; and, finally, (iii) draws inference for the target individual from the combined CD. In essence, iFusion strategically “borrows strength” from relevant individuals to enhance the efficiency of the target individual inference while preserving its validity. This article focuses on the setting where each individual study has a number of observations but its inference can be further improved by incorporating additional information from similar studies that is referred to as its clique. Under the setting, iFusion is shown to achieve oracle property under suitable conditions. It is also shown to be flexible and robust in handling heterogeneity arising from diverse data sources. The development is ideally suited for goal-directed applications. Computationally, iFusion is parallel in nature and scales up easily for big data. An efficient scalable algorithm is provided for implementation. Simulation studies and a real application in financial forecasting are presented. In effect, this article covers methodology, theory, computation, and application for individualized inference by iFusion. Journal: Journal of the American Statistical Association Pages: 1251-1267 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1672557 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1672557 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1251-1267 Template-Type: ReDIF-Article 1.0 Author-Name: Yassir Rabhi Author-X-Name-First: Yassir Author-X-Name-Last: Rabhi Author-Name: Taoufik Bouezmarni Author-X-Name-First: Taoufik Author-X-Name-Last: Bouezmarni Title: Nonparametric Inference for Copulas and Measures of Dependence Under Length-Biased Sampling and Informative Censoring Abstract: Length-biased data are often encountered in cross-sectional surveys and prevalent-cohort studies on disease durations. Under length-biased sampling subjects with longer disease durations have greater chance to be observed. As a result, covariate values linked to the longer survivors are favored by the sampling mechanism. When the sampled durations are also subject to right censoring, the censoring is informative. Modeling dependence structure without adjusting for these issues leads to biased results. In this article, we consider copulas for modeling dependence when the collected data are length-biased and account for both informative censoring and covariate bias that are naturally linked to length-biased sampling. We address nonparametric estimation of the bivariate distribution, copula function and its density, and Kendall and Spearman measures for right-censored length-biased data. The proposed estimator for the bivariate cdf is a Hadamard-differentiable functional of two MLEs (Kaplan–Meier and empirical cdf) and inherits their efficiency. Based on this estimator, we devise two estimators for copula function and a local-polynomial estimator for copula density that accounts for boundary bias. The limiting processes of the estimators are established by deriving their iid representations. As a by-product, we establish the oscillation behavior of the bivariate cdf estimator. In addition, we introduce estimators for Kendall and Spearman measures and study their weak convergence. The proposed method is applied to analyze a set of right-censored length-biased data on survival with dementia, collected as part of a nationwide study in Canada. Journal: Journal of the American Statistical Association Pages: 1268-1278 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1611586 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1611586 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1268-1278 Template-Type: ReDIF-Article 1.0 Author-Name: Yin Xia Author-X-Name-First: Yin Author-X-Name-Last: Xia Author-Name: Lexin Li Author-X-Name-First: Lexin Author-X-Name-Last: Li Author-Name: Samuel N. Lockhart Author-X-Name-First: Samuel N. Author-X-Name-Last: Lockhart Author-Name: William J. Jagust Author-X-Name-First: William J. Author-X-Name-Last: Jagust Title: Simultaneous Covariance Inference for Multimodal Integrative Analysis Abstract: Multimodal integrative analysis fuses different types of data collected on the same set of experimental subjects. It is becoming a norm in many branches of scientific research, such as multi-omics and multimodal neuroimaging analysis. In this article, we address the problem of simultaneous covariance inference of associations between multiple modalities, which is of a vital interest in multimodal integrative analysis. Recognizing that there are few readily available solutions in the literature for this type of problem, we develop a new simultaneous testing procedure. It provides an explicit quantification of statistical significance, a much improved detection power, as well as a rigid false discovery control. Our proposal makes novel and useful contributions from both the scientific perspective and the statistical methodological perspective. We demonstrate the efficacy of the new method through both simulations and a multimodal positron emission tomography study of associations between two hallmark pathological proteins of Alzheimer’s disease. Journal: Journal of the American Statistical Association Pages: 1279-1291 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1623040 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1623040 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1279-1291 Template-Type: ReDIF-Article 1.0 Author-Name: Geneviève Robin Author-X-Name-First: Geneviève Author-X-Name-Last: Robin Author-Name: Olga Klopp Author-X-Name-First: Olga Author-X-Name-Last: Klopp Author-Name: Julie Josse Author-X-Name-First: Julie Author-X-Name-Last: Josse Author-Name: Éric Moulines Author-X-Name-First: Éric Author-X-Name-Last: Moulines Author-Name: Robert Tibshirani Author-X-Name-First: Robert Author-X-Name-Last: Tibshirani Title: Main Effects and Interactions in Mixed and Incomplete Data Frames Abstract: A mixed data frame (MDF) is a table collecting categorical, numerical, and count observations. The use of MDF is widespread in statistics and the applications are numerous from abundance data in ecology to recommender systems. In many cases, an MDF exhibits simultaneously main effects, such as row, column, or group effects and interactions, for which a low-rank model has often been suggested. Although the literature on low-rank approximations is very substantial, with few exceptions, existing methods do not allow to incorporate main effects and interactions while providing statistical guarantees. The present work fills this gap. We propose an estimation method which allows to recover simultaneously the main effects and the interactions. We show that our method is near optimal under conditions which are met in our targeted applications. We also propose an optimization algorithm which provably converges to an optimal solution. Numerical experiments reveal that our method, mimi, performs well when the main effects are sparse and the interaction matrix has low-rank. We also show that mimi compares favorably to existing methods, in particular when the main effects are significantly large compared to the interactions, and when the proportion of missing entries is large. The method is available as an R package on the Comprehensive R Archive Network. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1292-1303 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1623041 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1623041 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1292-1303 Template-Type: ReDIF-Article 1.0 Author-Name: Chunlin Li Author-X-Name-First: Chunlin Author-X-Name-Last: Li Author-Name: Xiaotong Shen Author-X-Name-First: Xiaotong Author-X-Name-Last: Shen Author-Name: Wei Pan Author-X-Name-First: Wei Author-X-Name-Last: Pan Title: Likelihood Ratio Tests for a Large Directed Acyclic Graph Abstract: Inference of directional pairwise relations between interacting units in a directed acyclic graph (DAG), such as a regulatory gene network, is common in practice, imposing challenges because of lack of inferential tools. For example, inferring a specific gene pathway of a regulatory gene network is biologically important. Yet, frequentist inference of directionality of connections remains largely unexplored for regulatory models. In this article, we propose constrained likelihood ratio tests for inference of the connectivity as well as directionality subject to nonconvex acyclicity constraints in a Gaussian directed graphical model. Particularly, we derive the asymptotic distributions of the constrained likelihood ratios in a high-dimensional situation. For testing of connectivity, the asymptotic distribution is either chi-squared or normal depending on if the number of testable links in a DAG model is small. For testing of directionality, the asymptotic distribution is the minimum of d independent chi-squared variables with one-degree of freedom or a generalized Gamma distribution depending on if d is small, where d is number of breakpoints in a hypothesized pathway. Moreover, we develop a computational method to perform the proposed tests, which integrates an alternating direction method of multipliers and difference convex programming. Finally, the power analysis and simulations suggest that the tests achieve the desired objectives of inference. An analysis of an Alzheimer’s disease gene expression dataset illustrates the utility of the proposed method to infer a directed pathway in a gene network. Journal: Journal of the American Statistical Association Pages: 1304-1319 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1623042 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1623042 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1304-1319 Template-Type: ReDIF-Article 1.0 Author-Name: Beniamino Hadj-Amar Author-X-Name-First: Beniamino Author-X-Name-Last: Hadj-Amar Author-Name: Bärbel Finkenstädt Rand Author-X-Name-First: Bärbel Finkenstädt Author-X-Name-Last: Rand Author-Name: Mark Fiecas Author-X-Name-First: Mark Author-X-Name-Last: Fiecas Author-Name: Francis Lévi Author-X-Name-First: Francis Author-X-Name-Last: Lévi Author-Name: Robert Huckstepp Author-X-Name-First: Robert Author-X-Name-Last: Huckstepp Title: Bayesian Model Search for Nonstationary Periodic Time Series Abstract: We propose a novel Bayesian methodology for analyzing nonstationary time series that exhibit oscillatory behavior. We approximate the time series using a piecewise oscillatory model with unknown periodicities, where our goal is to estimate the change-points while simultaneously identifying the potentially changing periodicities in the data. Our proposed methodology is based on a trans-dimensional Markov chain Monte Carlo algorithm that simultaneously updates the change-points and the periodicities relevant to any segment between them. We show that the proposed methodology successfully identifies time changing oscillatory behavior in two applications which are relevant to e-Health and sleep research, namely the occurrence of ultradian oscillations in human skin temperature during the time of night rest, and the detection of instances of sleep apnea in plethysmographic respiratory traces. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1320-1335 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1623043 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1623043 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1320-1335 Template-Type: ReDIF-Article 1.0 Author-Name: Carina Gerstenberger Author-X-Name-First: Carina Author-X-Name-Last: Gerstenberger Author-Name: Daniel Vogel Author-X-Name-First: Daniel Author-X-Name-Last: Vogel Author-Name: Martin Wendler Author-X-Name-First: Martin Author-X-Name-Last: Wendler Title: Tests for Scale Changes Based on Pairwise Differences Abstract: In many applications it is important to know whether the amount of fluctuation in a series of observations changes over time. In this article, we investigate different tests for detecting changes in the scale of mean-stationary time series. The classical approach, based on the CUSUM test applied to the squared centered observations, is very vulnerable to outliers and impractical for heavy-tailed data, which leads us to contemplate test statistics based on alternative, less outlier-sensitive scale estimators. It turns out that the tests based on Gini’s mean difference (the average of all pairwise distances) and generalized Qn estimators (sample quantiles of all pairwise distances) are very suitable candidates. They improve upon the classical test not only under heavy tails or in the presence of outliers, but also under normality. We use recent results on the process convergence of U-statistics and U-quantiles for dependent sequences to derive the limiting distribution of the test statistics and propose estimators for the long-run variance. We show the consistency of the tests and demonstrate the applicability of the new change-point detection methods at two real-life data examples from hydrology and finance. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1336-1348 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1629938 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1629938 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1336-1348 Template-Type: ReDIF-Article 1.0 Author-Name: Meng Li Author-X-Name-First: Meng Author-X-Name-Last: Li Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Comparing and Weighting Imperfect Models Using D-Probabilities Abstract: We propose a new approach for assigning weights to models using a divergence-based method (D-probabilities), relying on evaluating parametric models relative to a nonparametric Bayesian reference using Kullback–Leibler divergence. D-probabilities are useful in goodness-of-fit assessments, in comparing imperfect models, and in providing model weights to be used in model aggregation. D-probabilities avoid some of the disadvantages of Bayesian model probabilities, such as large sensitivity to prior choice, and tend to place higher weight on a greater diversity of models. In an application to linear model selection against a Gaussian process reference, we provide simple analytic forms for routine implementation and show that D-probabilities automatically penalize model complexity. Some asymptotic properties are described, and we provide interesting probabilistic interpretations of the proposed model weights. The framework is illustrated through simulation examples and an ozone data application. Supplementary materials for this aricle are available online. Journal: Journal of the American Statistical Association Pages: 1349-1360 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1611140 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1611140 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1349-1360 Template-Type: ReDIF-Article 1.0 Author-Name: Holger Dette Author-X-Name-First: Holger Author-X-Name-Last: Dette Author-Name: Josua Gösmann Author-X-Name-First: Josua Author-X-Name-Last: Gösmann Title: A Likelihood Ratio Approach to Sequential Change Point Detection for a General Class of Parameters Abstract: In this article, we propose a new approach for sequential monitoring of a general class of parameters of a d-dimensional time series, which can be estimated by approximately linear functionals of the empirical distribution function. We consider a closed-end method, which is motivated by the likelihood ratio test principle and compare the new method with two alternative procedures. We also incorporate self-normalization such that estimation of the long-run variance is not necessary. We prove that for a large class of testing problems the new detection scheme has asymptotic level α and is consistent. The asymptotic theory is illustrated for the important cases of monitoring a change in the mean, variance, and correlation. By means of a simulation study it is demonstrated that the new test performs better than the currently available procedures for these problems. Finally, the methodology is illustrated by a small data example investigating index prices from the dot-com bubble. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1361-1377 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1630562 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1630562 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1361-1377 Template-Type: ReDIF-Article 1.0 Author-Name: Karthik Bharath Author-X-Name-First: Karthik Author-X-Name-Last: Bharath Author-Name: Sebastian Kurtek Author-X-Name-First: Sebastian Author-X-Name-Last: Kurtek Title: Distribution on Warp Maps for Alignment of Open and Closed Curves Abstract: Alignment of curve data is an integral part of their statistical analysis, and can be achieved using model- or optimization-based approaches. The parameter space is usually the set of monotone, continuous warp maps of a domain. Infinite-dimensional nature of the parameter space encourages sampling based approaches, which require a distribution on the set of warp maps. Moreover, the distribution should also enable sampling in the presence of important landmark information on the curves which constrain the warp maps. For alignment of closed and open curves in Rd,d=1,2,3 , possibly with landmark information, we provide a constructive, point-process based definition of a distribution on the set of warp maps of [0, 1] and the unit circle S , that is, (1) simple to sample from, and (2) possesses the desiderata for decomposition of the alignment problem with landmark constraints into multiple unconstrained ones. For warp maps on [0, 1], the distribution is related to the Dirichlet process. We demonstrate its utility by using it as a prior distribution on warp maps in a Bayesian model for alignment of two univariate curves, and as a proposal distribution in a stochastic algorithm that optimizes a suitable alignment functional for higher-dimensional curves. Several examples from simulated and real datasets are provided. Journal: Journal of the American Statistical Association Pages: 1378-1392 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1632066 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1632066 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1378-1392 Template-Type: ReDIF-Article 1.0 Author-Name: Tingyou Zhou Author-X-Name-First: Tingyou Author-X-Name-Last: Zhou Author-Name: Liping Zhu Author-X-Name-First: Liping Author-X-Name-Last: Zhu Author-Name: Chen Xu Author-X-Name-First: Chen Author-X-Name-Last: Xu Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Title: Model-Free Forward Screening Via Cumulative Divergence Abstract: Feature screening plays an important role in the analysis of ultrahigh dimensional data. Due to complicated model structure and high noise level, existing screening methods often suffer from model misspecification and the presence of outliers. To address these issues, we introduce a new metric named cumulative divergence (CD), and develop a CD-based forward screening procedure. This forward screening method is model-free and resistant to the presence of outliers in the response. It also incorporates the joint effects among covariates into the screening process. With a data-driven threshold, the new method can automatically determine the number of features that should be retained after screening. These merits make the CD-based screening very appealing in practice. Under certain regularity conditions, we show that the proposed method possesses sure screening property. The performance of our proposal is illustrated through simulations and a real data example. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1393-1405 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1632078 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1632078 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1393-1405 Template-Type: ReDIF-Article 1.0 Author-Name: Guan Yu Author-X-Name-First: Guan Author-X-Name-Last: Yu Author-Name: Quefeng Li Author-X-Name-First: Quefeng Author-X-Name-Last: Li Author-Name: Dinggang Shen Author-X-Name-First: Dinggang Author-X-Name-Last: Shen Author-Name: Yufeng Liu Author-X-Name-First: Yufeng Author-X-Name-Last: Liu Title: Optimal Sparse Linear Prediction for Block-missing Multi-modality Data Without Imputation Abstract: In modern scientific research, data are often collected from multiple modalities. Since different modalities could provide complementary information, statistical prediction methods using multimodality data could deliver better prediction performance than using single modality data. However, one special challenge for using multimodality data is related to block-missing data. In practice, due to dropouts or the high cost of measures, the observations of a certain modality can be missing completely for some subjects. In this paper, we propose a new direct sparse regression procedure using covariance from multimodality data (DISCOM). Our proposed DISCOM method includes two steps to find the optimal linear prediction of a continuous response variable using block-missing multimodality predictors. In the first step, rather than deleting or imputing missing data, we make use of all available information to estimate the covariance matrix of the predictors and the cross-covariance vector between the predictors and the response variable. The proposed new estimate of the covariance matrix is a linear combination of the identity matrix, the estimates of the intra-modality covariance matrix and the cross-modality covariance matrix. Flexible estimates for both the sub-Gaussian and heavy-tailed cases are considered. In the second step, based on the estimated covariance matrix and the estimated cross-covariance vector, an extended Lasso-type estimator is used to deliver a sparse estimate of the coefficients in the optimal linear prediction. The number of samples that are effectively used by DISCOM is the minimum number of samples with available observations from two modalities, which can be much larger than the number of samples with complete observations from all modalities. The effectiveness of the proposed method is demonstrated by theoretical studies, simulated examples, and a real application from the Alzheimer’s Disease Neuroimaging Initiative. The comparison between DISCOM and some existing methods also indicates the advantages of our proposed method. Journal: Journal of the American Statistical Association Pages: 1406-1419 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1632079 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1632079 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1406-1419 Template-Type: ReDIF-Article 1.0 Author-Name: Eardi Lila Author-X-Name-First: Eardi Author-X-Name-Last: Lila Author-Name: John A. D. Aston Author-X-Name-First: John A. D. Author-X-Name-Last: Aston Title: Statistical Analysis of Functions on Surfaces, With an Application to Medical Imaging Abstract: Abstract–In functional data analysis, data are commonly assumed to be smooth functions on a fixed interval of the real line. In this work, we introduce a comprehensive framework for the analysis of functional data, whose domain is a two-dimensional manifold and the domain itself is subject to variability from sample to sample. We formulate a statistical model for such data, here called functions on surfaces, which enables a joint representation of the geometric and functional aspects, and propose an associated estimation framework. We assess the validity of the framework by performing a simulation study and we finally apply it to the analysis of neuroimaging data of cortical thickness, acquired from the brains of different subjects, and thus lying on domains with different geometries. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1420-1434 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1635479 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635479 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1420-1434 Template-Type: ReDIF-Article 1.0 Author-Name: Zhigang Yao Author-X-Name-First: Zhigang Author-X-Name-Last: Yao Author-Name: Zhenyue Zhang Author-X-Name-First: Zhenyue Author-X-Name-Last: Zhang Title: Principal Boundary on Riemannian Manifolds Abstract: We consider the classification problem and focus on nonlinear methods for classification on manifolds. For multivariate datasets lying on an embedded nonlinear Riemannian manifold within the higher-dimensional ambient space, we aim to acquire a classification boundary for the classes with labels, using the intrinsic metric on the manifolds. Motivated by finding an optimal boundary between the two classes, we invent a novel approach—the principal boundary. From the perspective of classification, the principal boundary is defined as an optimal curve that moves in between the principal flows traced out from two classes of data, and at any point on the boundary, it maximizes the margin between the two classes. We estimate the boundary in quality with its direction, supervised by the two principal flows. We show that the principal boundary yields the usual decision boundary found by the support vector machine in the sense that locally, the two boundaries coincide. Some optimality and convergence properties of the random principal boundary and its population counterpart are also shown. We illustrate how to find, use, and interpret the principal boundary with an application in real data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1435-1448 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1610660 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1610660 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1435-1448 Template-Type: ReDIF-Article 1.0 Author-Name: Matias D. Cattaneo Author-X-Name-First: Matias D. Author-X-Name-Last: Cattaneo Author-Name: Michael Jansson Author-X-Name-First: Michael Author-X-Name-Last: Jansson Author-Name: Xinwei Ma Author-X-Name-First: Xinwei Author-X-Name-Last: Ma Title: Simple Local Polynomial Density Estimators Abstract: This article introduces an intuitive and easy-to-implement nonparametric density estimator based on local polynomial techniques. The estimator is fully boundary adaptive and automatic, but does not require prebinning or any other transformation of the data. We study the main asymptotic properties of the estimator, and use these results to provide principled estimation, inference, and bandwidth selection methods. As a substantive application of our results, we develop a novel discontinuity in density testing procedure, an important problem in regression discontinuity designs and other program evaluation settings. An illustrative empirical application is given. Two companion Stata and R software packages are provided. Journal: Journal of the American Statistical Association Pages: 1449-1455 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1635480 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635480 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1449-1455 Template-Type: ReDIF-Article 1.0 Author-Name: Chaowen Zheng Author-X-Name-First: Chaowen Author-X-Name-Last: Zheng Author-Name: Yichao Wu Author-X-Name-First: Yichao Author-X-Name-Last: Wu Title: Nonparametric Estimation of Multivariate Mixtures Abstract: A multivariate mixture model is determined by three elements: the number of components, the mixing proportions, and the component distributions. Assuming that the number of components is given and that each mixture component has independent marginal distributions, we propose a nonparametric method to estimate the component distributions. The basic idea is to convert the estimation of component density functions to a problem of estimating the coordinates of the component density functions with respect to a good set of basis functions. Specifically, we construct a set of basis functions by using conditional density functions and try to recover the coordinates of component density functions with respect to this set of basis functions. Furthermore, we show that our estimator for the component density functions is consistent. Numerical studies are used to compare our algorithm with other existing nonparametric methods of estimating component distributions under the assumption of conditionally independent marginals. Journal: Journal of the American Statistical Association Pages: 1456-1471 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1635481 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635481 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1456-1471 Template-Type: ReDIF-Article 1.0 Author-Name: Jack Kamm Author-X-Name-First: Jack Author-X-Name-Last: Kamm Author-Name: Jonathan Terhorst Author-X-Name-First: Jonathan Author-X-Name-Last: Terhorst Author-Name: Richard Durbin Author-X-Name-First: Richard Author-X-Name-Last: Durbin Author-Name: Yun S. Song Author-X-Name-First: Yun S. Author-X-Name-Last: Song Title: Efficiently Inferring the Demographic History of Many Populations With Allele Count Data Abstract: The sample frequency spectrum (SFS), or histogram of allele counts, is an important summary statistic in evolutionary biology, and is often used to infer the history of population size changes, migrations, and other demographic events affecting a set of populations. The expected multipopulation SFS under a given demographic model can be efficiently computed when the populations in the model are related by a tree, scaling to hundreds of populations. Admixture, back-migration, and introgression are common natural processes that violate the assumption of a tree-like population history, however, and until now the expected SFS could be computed for only a handful of populations when the demographic history is not a tree. In this article, we present a new method for efficiently computing the expected SFS and linear functionals of it, for demographies described by general directed acyclic graphs. This method can scale to more populations than previously possible for complex demographic histories including admixture. We apply our method to an 8-population SFS to estimate the timing and strength of a proposed “basal Eurasian” admixture event in human history. We implement and release our method in a new open-source software package momi2. Journal: Journal of the American Statistical Association Pages: 1472-1487 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1635482 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635482 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1472-1487 Template-Type: ReDIF-Article 1.0 Author-Name: Wei Ma Author-X-Name-First: Wei Author-X-Name-Last: Ma Author-Name: Yichen Qin Author-X-Name-First: Yichen Author-X-Name-Last: Qin Author-Name: Yang Li Author-X-Name-First: Yang Author-X-Name-Last: Li Author-Name: Feifang Hu Author-X-Name-First: Feifang Author-X-Name-Last: Hu Title: Statistical Inference for Covariate-Adaptive Randomization Procedures Abstract: Covariate-adaptive randomization (CAR) procedures are frequently used in comparative studies to increase the covariate balance across treatment groups. However, because randomization inevitably uses the covariate information when forming balanced treatment groups, the validity of classical statistical methods after such randomization is often unclear. In this article, we derive the theoretical properties of statistical methods based on general CAR under the linear model framework. More importantly, we explicitly unveil the relationship between covariate-adaptive and inference properties by deriving the asymptotic representations of the corresponding estimators. We apply the proposed general theory to various randomization procedures such as complete randomization, rerandomization, pairwise sequential randomization, and Atkinson’s DA-biased coin design and compare their performance analytically. Based on the theoretical results, we then propose a new approach to obtain valid and more powerful tests. These results open a door to understand and analyze experiments based on CAR. Simulation studies provide further evidence of the advantages of the proposed framework and the theoretical results. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1488-1497 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1635483 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635483 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1488-1497 Template-Type: ReDIF-Article 1.0 Author-Name: Federico Ricciardi Author-X-Name-First: Federico Author-X-Name-Last: Ricciardi Author-Name: Alessandra Mattei Author-X-Name-First: Alessandra Author-X-Name-Last: Mattei Author-Name: Fabrizia Mealli Author-X-Name-First: Fabrizia Author-X-Name-Last: Mealli Title: Bayesian Inference for Sequential Treatments Under Latent Sequential Ignorability Abstract: We focus on causal inference for longitudinal treatments, where units are assigned to treatments at multiple time points, aiming to assess the effect of different treatment sequences on an outcome observed at a final point. A common assumption in similar studies is sequential ignorability (SI): treatment assignment at each time point is assumed independent of future potential outcomes given past observed outcomes and covariates. SI is questionable when treatment participation depends on individual choices, and treatment assignment may depend on unobservable quantities associated with future outcomes. We rely on principal stratification to formulate a relaxed version of SI: latent sequential ignorability (LSI) assumes that treatment assignment is conditionally independent on future potential outcomes given past treatments, covariates, and principal stratum membership, a latent variable defined by the joint value of observed and missing intermediate outcomes. We evaluate SI and LSI, using theoretical arguments and simulation studies to investigate the performance of the two assumptions when one holds and inference is conducted under both. Simulations show that when SI does not hold, inference performed under SI leads to misleading conclusions. Conversely, LSI generally leads to correct posterior distributions, irrespective of which assumption holds. Journal: Journal of the American Statistical Association Pages: 1498-1517 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1623039 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1623039 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1498-1517 Template-Type: ReDIF-Article 1.0 Author-Name: Colin B. Fogarty Author-X-Name-First: Colin B. Author-X-Name-Last: Fogarty Title: Studentized Sensitivity Analysis for the Sample Average Treatment Effect in Paired Observational Studies Abstract: A fundamental limitation of causal inference in observational studies is that perceived evidence for an effect might instead be explained by factors not accounted for in the primary analysis. Methods for assessing the sensitivity of a study’s conclusions to unmeasured confounding have been established under the assumption that the treatment effect is constant across all individuals. In the potential presence of unmeasured confounding, it has been argued that certain patterns of effect heterogeneity may conspire with unobserved covariates to render the performed sensitivity analysis inadequate. We present a new method for conducting a sensitivity analysis for the sample average treatment effect in the presence of effect heterogeneity in paired observational studies. Our recommended procedure, called the studentized sensitivity analysis, represents an extension of recent work on studentized permutation tests to the case of observational studies, where randomizations are no longer drawn uniformly. The method naturally extends conventional tests for the sample average treatment effect in paired experiments to the case of unknown, but bounded, probabilities of assignment to treatment. In so doing, we illustrate that concerns about certain sensitivity analyses operating under the presumption of constant effects are largely unwarranted. Journal: Journal of the American Statistical Association Pages: 1518-1530 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1632072 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1632072 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1518-1530 Template-Type: ReDIF-Article 1.0 Author-Name: Gabrielle Simoneau Author-X-Name-First: Gabrielle Author-X-Name-Last: Simoneau Author-Name: Erica E. M. Moodie Author-X-Name-First: Erica E. M. Author-X-Name-Last: Moodie Author-Name: Jagtar S. Nijjar Author-X-Name-First: Jagtar S. Author-X-Name-Last: Nijjar Author-Name: Robert W. Platt Author-X-Name-First: Robert W. Author-X-Name-Last: Platt Author-Name: the Scottish Early Rheumatoid Arthritis Inception Cohort Investigators Author-X-Name-First: Author-X-Name-Last: the Scottish Early Rheumatoid Arthritis Inception Cohort Investigators Title: Estimating Optimal Dynamic Treatment Regimes With Survival Outcomes Abstract: The statistical study of precision medicine is concerned with dynamic treatment regimes (DTRs) in which treatment decisions are tailored to patient-level information. Individuals are followed through multiple stages of clinical intervention, and the goal is to perform inferences on the sequence of personalized treatment decision rules to be applied in practice. Of interest is the identification of an optimal DTR, that is, the sequence of treatment decisions that yields the best expected outcome. Statistical methods for identifying optimal DTRs from observational data are theoretically complex and not easily implementable by researchers, especially when the outcome of interest is survival time. We propose a doubly robust, easy to implement method for estimating optimal DTRs with survival endpoints subject to right-censoring which requires solving a series of weighted generalized estimating equations. We provide a proof of consistency that relies on the balancing property of the weights and derive a formula for the asymptotic variance of the resulting estimators. We illustrate our novel approach with an application to the treatment of rheumatoid arthritis using observational data from the Scottish Early Rheumatoid Arthritis Inception Cohort. Our method, called dynamic weighted survival modeling, has been implemented in the DTRreg R package. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1531-1539 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1629939 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1629939 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1531-1539 Template-Type: ReDIF-Article 1.0 Author-Name: Shu Yang Author-X-Name-First: Shu Author-X-Name-Last: Yang Author-Name: Peng Ding Author-X-Name-First: Peng Author-X-Name-Last: Ding Title: Combining Multiple Observational Data Sources to Estimate Causal Effects Abstract: The era of big data has witnessed an increasing availability of multiple data sources for statistical analyses. We consider estimation of causal effects combining big main data with unmeasured confounders and smaller validation data with supplementary information on these confounders. Under the unconfoundedness assumption with completely observed confounders, the smaller validation data allow for constructing consistent estimators for causal effects, but the big main data can only give error-prone estimators in general. However, by leveraging the information in the big main data in a principled way, we can improve the estimation efficiencies yet preserve the consistencies of the initial estimators based solely on the validation data. Our framework applies to asymptotically normal estimators, including the commonly used regression imputation, weighting, and matching estimators, and does not require a correct specification of the model relating the unmeasured confounders to the observed variables. We also propose appropriate bootstrap procedures, which makes our method straightforward to implement using software routines for existing estimators. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1540-1554 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2019.1609973 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1609973 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1540-1554 Template-Type: ReDIF-Article 1.0 Author-Name: Genevera I. Allen Author-X-Name-First: Genevera I. Author-X-Name-Last: Allen Title: Handbook of Graphical Models Journal: Journal of the American Statistical Association Pages: 1555-1557 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2020.1801279 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801279 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1555-1557 Template-Type: ReDIF-Article 1.0 Author-Name: Ling Leng Author-X-Name-First: Ling Author-X-Name-Last: Leng Title: Statistical Computing With R Journal: Journal of the American Statistical Association Pages: 1557-1558 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2020.1801280 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801280 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1557-1558 Template-Type: ReDIF-Article 1.0 Author-Name: Ming Chen Author-X-Name-First: Ming Author-X-Name-Last: Chen Title: Time Series Clustering and Classification Journal: Journal of the American Statistical Association Pages: 1558-1558 Issue: 531 Volume: 115 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2020.1801281 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801281 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1558-1558 Template-Type: ReDIF-Article 1.0 Author-Name: Kaushik Jana Author-X-Name-First: Kaushik Author-X-Name-Last: Jana Author-Name: Debasis Sengupta Author-X-Name-First: Debasis Author-X-Name-Last: Sengupta Author-Name: Subrata Kundu Author-X-Name-First: Subrata Author-X-Name-Last: Kundu Author-Name: Arindam Chakraborty Author-X-Name-First: Arindam Author-X-Name-Last: Chakraborty Author-Name: Purnima Shaw Author-X-Name-First: Purnima Author-X-Name-Last: Shaw Title: The Statistical Face of a Region Under Monsoon Rainfall in Eastern India Abstract: A region under rainfall is a contiguous spatial area receiving positive precipitation at a particular time. The probabilistic behavior of such a region is an issue of interest in meteorological studies. A region under rainfall can be viewed as a shape object of a special kind, where scale and rotational invariance are not necessarily desirable attributes of a mathematical representation. For modeling variation in objects of this type, we propose an approximation of the boundary that can be represented as a real valued function, and arrive at further approximation through functional principal component analysis, after suitable adjustment for asymmetry and incompleteness in the data. The analysis of an open access satellite dataset on monsoon precipitation over Eastern Indian subcontinent leads to explanation of most of the variation in shapes of the regions under rainfall through a handful of interpretable functions that can be further approximated parametrically. The most important aspect of shape is found to be the size followed by contraction/elongation, mostly along two pairs of orthogonal axes. The different modes of variation are remarkably stable across calendar years and across different thresholds for minimum size of the region. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1559-1573 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1681275 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1681275 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1559-1573 Template-Type: ReDIF-Article 1.0 Author-Name: Xiangnan Feng Author-X-Name-First: Xiangnan Author-X-Name-Last: Feng Author-Name: Tengfei Li Author-X-Name-First: Tengfei Author-X-Name-Last: Li Author-Name: Xinyuan Song Author-X-Name-First: Xinyuan Author-X-Name-Last: Song Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Title: Bayesian Scalar on Image Regression With Nonignorable Nonresponse Abstract: Medical imaging has become an increasingly important tool in screening, diagnosis, prognosis, and treatment of various diseases given its information visualization and quantitative assessment. The aim of this article is to develop a Bayesian scalar-on-image regression model to integrate high-dimensional imaging data and clinical data to predict cognitive, behavioral, or emotional outcomes, while allowing for nonignorable missing outcomes. Such a nonignorable nonresponse consideration is motivated by examining the association between baseline characteristics and cognitive abilities for 802 Alzheimer patients enrolled in the Alzheimer’s Disease Neuroimaging Initiative 1 (ADNI1), for which data are partially missing. Ignoring such missing data may distort the accuracy of statistical inference and provoke misleading results. To address this issue, we propose an imaging exponential tilting model to delineate the data missing mechanism and incorporate an instrumental variable to facilitate model identifiability followed by a Bayesian framework with Markov chain Monte Carlo algorithms to conduct statistical inference. This approach is validated in simulation studies where both the finite sample performance and asymptotic properties are evaluated and compared with the model with fully observed data and that with a misspecified ignorable missing mechanism. Our proposed methods are finally carried out on the ADNI1 dataset, which turns out to capture both of those clinical risk factors and imaging regions consistent with the existing literature that exhibits clinical significance.Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1574-1597 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1686391 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1686391 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1574-1597 Template-Type: ReDIF-Article 1.0 Author-Name: Bingduo Yang Author-X-Name-First: Bingduo Author-X-Name-Last: Yang Author-Name: Wei Long Author-X-Name-First: Wei Author-X-Name-Last: Long Author-Name: Liang Peng Author-X-Name-First: Liang Author-X-Name-Last: Peng Author-Name: Zongwu Cai Author-X-Name-First: Zongwu Author-X-Name-Last: Cai Title: Testing the Predictability of U.S. Housing Price Index Returns Based on an IVX-AR Model Abstract: We use ten common macroeconomic variables to test for the predictability of the quarterly growth rate of house price index (HPI) in the United States during 1975:Q1–2018:Q2. We extend the instrumental variable based Wald statistic (IVX-KMS) proposed by Kostakis, Magdalinos, and Stamatogiannis to a new instrumental variable based Wald statistic (IVX-AR) which accounts for serial correlation and heteroscedasticity in the error terms of the linear predictive regression model. Simulation results show that the proposed IVX-AR exhibits excellent size control regardless of the degree of serial correlation in the error terms and the persistency in the predictive variables, while IVX-KMS displays severe size distortions. The empirical results indicate that the percentage of residential fixed investment in GDP is fairly a robust predictor of the growth rate of HPI. However, other macroeconomic variables’ strong predictive ability detected by IVX-KMS is likely to be driven by the highly correlated error terms in the predictive regressions and thus becomes insignificant when the proposed IVX-AR method is implemented. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1598-1619 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1686392 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1686392 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1598-1619 Template-Type: ReDIF-Article 1.0 Author-Name: Yang Ni Author-X-Name-First: Yang Author-X-Name-Last: Ni Author-Name: Peter Müller Author-X-Name-First: Peter Author-X-Name-Last: Müller Author-Name: Yuan Ji Author-X-Name-First: Yuan Author-X-Name-Last: Ji Title: Bayesian Double Feature Allocation for Phenotyping With Electronic Health Records Abstract: Electronic health records (EHR) provide opportunities for deeper understanding of human phenotypes—in our case, latent disease—based on statistical modeling. We propose a categorical matrix factorization method to infer latent diseases from EHR data. A latent disease is defined as an unknown biological aberration that causes a set of common symptoms for a group of patients. The proposed approach is based on a novel double feature allocation model which simultaneously allocates features to the rows and the columns of a categorical matrix. Using a Bayesian approach, available prior information on known diseases (e.g., hypertension and diabetes) greatly improves identifiability and interpretability of the latent diseases. We assess the proposed approach by simulation studies including mis-specified models and comparison with sparse latent factor models. In the application to a Chinese EHR dataset, we identify 10 latent diseases, each of which is shared by groups of subjects with specific health traits related to lipid disorder, thrombocytopenia, polycythemia, anemia, bacterial and viral infections, allergy, and malnutrition. The identification of the latent diseases can help healthcare officials better monitor the subjects’ ongoing health conditions and look into potential risk factors and approaches for disease prevention. We cross-check the reported latent diseases with medical literature and find agreement between our discovery and reported findings elsewhere. We provide an R package “dfa” implementing our method and an R shiny web application reporting the findings.Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1620-1634 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1686985 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1686985 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1620-1634 Template-Type: ReDIF-Article 1.0 Author-Name: Samrachana Adhikari Author-X-Name-First: Samrachana Author-X-Name-Last: Adhikari Author-Name: Sherri Rose Author-X-Name-First: Sherri Author-X-Name-Last: Rose Author-Name: Sharon-Lise Normand Author-X-Name-First: Sharon-Lise Author-X-Name-Last: Normand Title: Nonparametric Bayesian Instrumental Variable Analysis: Evaluating Heterogeneous Effects of Coronary Arterial Access Site Strategies Abstract: Percutaneous coronary interventions (PCIs) are nonsurgical procedures to open blocked blood vessels to the heart, frequently using a catheter to place a stent. The catheter can be inserted into the blood vessels using an artery in the groin or an artery in the wrist. Because clinical trials have indicated that access via the wrist may result in fewer post procedure complications, shortening the length of stay, and ultimately cost less than groin access, adoption of access via the wrist has been encouraged. However, patients treated in usual care are likely to differ from those participating in clinical trials, and there is reason to believe that the effectiveness of wrist access may differ between males and females. Moreover, the choice of artery access strategy is likely to be influenced by patient or physician unmeasured factors. To study the effectiveness of the two artery access site strategies on hospitalization charges, we use data from a state-mandated clinical registry including 7963 patients undergoing PCI. A hierarchical Bayesian likelihood-based instrumental variable analysis under a latent index modeling framework is introduced to jointly model outcomes and treatment status. Our approach accounts for unobserved heterogeneity via a latent factor structure, and permits nonparametric error distributions with Dirichlet process mixture models. Our results demonstrate that artery access in the wrist reduces hospitalization charges compared to access in the groin, with a higher mean reduction for male patients.Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1635-1644 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1688663 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1688663 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1635-1644 Template-Type: ReDIF-Article 1.0 Author-Name: Changgee Chang Author-X-Name-First: Changgee Author-X-Name-Last: Chang Author-Name: Jeong Hoon Jang Author-X-Name-First: Jeong Hoon Author-X-Name-Last: Jang Author-Name: Amita Manatunga Author-X-Name-First: Amita Author-X-Name-Last: Manatunga Author-Name: Andrew T. Taylor Author-X-Name-First: Andrew T. Author-X-Name-Last: Taylor Author-Name: Qi Long Author-X-Name-First: Qi Author-X-Name-Last: Long Title: A Bayesian Latent Class Model to Predict Kidney Obstruction in the Absence of Gold Standard Abstract: Kidney obstruction, if untreated in a timely manner, can lead to irreversible loss of renal function. A widely used technology for evaluations of kidneys with suspected obstruction is diuresis renography. However, it is generally very challenging for radiologists who typically interpret renography data in practice to build high level of competency due to the low volume of renography studies and insufficient training. Another challenge is that there is currently no gold standard for detection of kidney obstruction. Seeking to develop a computer-aided diagnostic (CAD) tool that can assist practicing radiologists to reduce errors in the interpretation of kidney obstruction, a recent study collected data from diuresis renography, interpretations on the renography data from highly experienced nuclear medicine experts as well as clinical data. To achieve the objective, we develop a statistical model that can be used as a CAD tool for assisting radiologists in kidney interpretation. We use a Bayesian latent class modeling approach for predicting kidney obstruction through the integrative analysis of time-series renogram data, expert ratings, and clinical variables. A nonparametric Bayesian latent factor regression approach is adopted for modeling renogram curves in which the coefficients of the basis functions are parameterized via the factor loadings dependent on the latent disease status and the extended latent factors that can also adjust for clinical variables. A hierarchical probit model is used for expert ratings, allowing for training with rating data from multiple experts while predicting with at most one expert, which makes the proposed model operable in practice. An efficient MCMC algorithm is developed to train the model and predict kidney obstruction with associated uncertainty. We demonstrate the superiority of the proposed method over several existing methods through extensive simulations. Analysis of the renal study also lends support to the usefulness of our model as a CAD tool to assist less experienced radiologists in the field. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1645-1663 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1689983 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1689983 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1645-1663 Template-Type: ReDIF-Article 1.0 Author-Name: Chih-Li Sung Author-X-Name-First: Chih-Li Author-X-Name-Last: Sung Author-Name: Ying Hung Author-X-Name-First: Ying Author-X-Name-Last: Hung Author-Name: William Rittase Author-X-Name-First: William Author-X-Name-Last: Rittase Author-Name: Cheng Zhu Author-X-Name-First: Cheng Author-X-Name-Last: Zhu Author-Name: C. F. J. Wu Author-X-Name-First: C. F. J. Author-X-Name-Last: Wu Title: Calibration for Computer Experiments With Binary Responses and Application to Cell Adhesion Study Abstract: Calibration refers to the estimation of unknown parameters which are present in computer experiments but not available in physical experiments. An accurate estimation of these parameters is important because it provides a scientific understanding of the underlying system which is not available in physical experiments. Most of the work in the literature is limited to the analysis of continuous responses. Motivated by a study of cell adhesion experiments, we propose a new calibration framework for binary responses. Its application to the T cell adhesion data provides insight into the unknown values of the kinetic parameters which are difficult to determine by physical experiments due to the limitation of the existing experimental techniques. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1664-1674 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1699419 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1699419 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1664-1674 Template-Type: ReDIF-Article 1.0 Author-Name: Samuel D. Pimentel Author-X-Name-First: Samuel D. Author-X-Name-Last: Pimentel Author-Name: Rachel R. Kelz Author-X-Name-First: Rachel R. Author-X-Name-Last: Kelz Title: Optimal Tradeoffs in Matched Designs Comparing US-Trained and Internationally Trained Surgeons Abstract: Does receiving a medical education outside the United States impact a surgeon’s performance? We study this question by matching operations performed by internationally trained surgeons to those performed by US-trained surgeons in reanalysis of a large health outcomes study. An effective matched design must achieve several goals, including balancing covariate distributions marginally, ensuring units within individual pairs have similar values on key covariates, and using a sufficiently large sample from the raw data. Yet in our study, optimizing some of these goals forces less desirable results on others. We address such tradeoffs from a multi-objective optimization perspective by creating matched designs that are Pareto optimal with respect to two goals. We provide general tools for generating representative subsets of Pareto optimal solution sets and articulate how they can be used to improve decision-making in observational study design. In the motivating surgical outcomes study, formulating a multi-objective version of the problem helps us balance an important variable without sacrificing two other design goals, average closeness of matched pairs on a multivariate distance and size of the final matched sample. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1675-1688 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2020.1720693 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1720693 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1675-1688 Template-Type: ReDIF-Article 1.0 Author-Name: Emma G. Thomas Author-X-Name-First: Emma G. Author-X-Name-Last: Thomas Author-Name: Lorenzo Trippa Author-X-Name-First: Lorenzo Author-X-Name-Last: Trippa Author-Name: Giovanni Parmigiani Author-X-Name-First: Giovanni Author-X-Name-Last: Parmigiani Author-Name: Francesca Dominici Author-X-Name-First: Francesca Author-X-Name-Last: Dominici Title: Estimating the Effects of Fine Particulate Matter on 432 Cardiovascular Diseases Using Multi-Outcome Regression With Tree-Structured Shrinkage Abstract: The positive relationship between airborne fine particulate matter (PM2.5) and cardiovascular disease (CVD) is established. Little is known about effect size heterogeneity across distinct CVD outcomes. We conducted a multi-outcome case-crossover study of Medicare beneficiaries aged >65 years residing in the mainland USA from 2000 through 2012. The exposure was two-day average PM2.5 in each individual’s residential zipcode. The outcomes were hospitalization for 432 distinct CVDs defined by the International Classification of Diseases, Revision 9. Our dataset included almost 24 million CVD hospitalizations. We analyzed the data using multi-outcome regression with tree-structured shrinkage (MOReTreeS), a novel method that enables: (1) borrowing of strength across outcomes; (2) data-driven discovery of outcome groups that are similarly affected by the exposure; (3) estimation of a single effect for each group. MOReTreeS grouped 420 outcomes together; for this group, the odds ratio [OR] for hospitalization associated with a 10 μg m− 3 increase in PM2.5 was 1.011 (95% credible interval [CI] = 1.011–1.012). The model identified congestive heart failure as having the strongest positive association with PM2.5 (OR = 1.019; 95%CI = 1.017–1.022). Some outcomes exhibited negative associations with PM2.5, including aortic dissection, subarachnoid and intracerebral hemorrhage, abdominal aneurysm, and essential hypertension; further research is needed to understand these counterintuitive findings. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1689-1699 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2020.1722134 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1722134 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1689-1699 Template-Type: ReDIF-Article 1.0 Author-Name: Lan Wang Author-X-Name-First: Lan Author-X-Name-Last: Wang Author-Name: Bo Peng Author-X-Name-First: Bo Author-X-Name-Last: Peng Author-Name: Jelena Bradic Author-X-Name-First: Jelena Author-X-Name-Last: Bradic Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Author-Name: Yunan Wu Author-X-Name-First: Yunan Author-X-Name-Last: Wu Title: A Tuning-free Robust and Efficient Approach to High-dimensional Regression Abstract: We introduce a novel approach for high-dimensional regression with theoretical guarantees. The new procedure overcomes the challenge of tuning parameter selection of Lasso and possesses several appealing properties. It uses an easily simulated tuning parameter that automatically adapts to both the unknown random error distribution and the correlation structure of the design matrix. It is robust with substantial efficiency gain for heavy-tailed random errors while maintaining high efficiency for normal random errors. Comparing with other alternative robust regression procedures, it also enjoys the property of being equivariant when the response variable undergoes a scale transformation. Computationally, it can be efficiently solved via linear programming. Theoretically, under weak conditions on the random error distribution, we establish a finite-sample error bound with a near-oracle rate for the new estimator with the simulated tuning parameter. Our results make useful contributions to mending the gap between the practice and theory of Lasso and its variants. We also prove that further improvement in efficiency can be achieved by a second-stage enhancement with some light tuning. Our simulation results demonstrate that the proposed methods often outperform cross-validated Lasso in various settings. Journal: Journal of the American Statistical Association Pages: 1700-1714 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2020.1840989 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1840989 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1700-1714 Template-Type: ReDIF-Article 1.0 Author-Name: Po-Ling Loh Author-X-Name-First: Po-Ling Author-X-Name-Last: Loh Title: Comment on “A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression” Journal: Journal of the American Statistical Association Pages: 1715-1716 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2020.1837141 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1837141 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1715-1716 Template-Type: ReDIF-Article 1.0 Author-Name: Xiudi Li Author-X-Name-First: Xiudi Author-X-Name-Last: Li Author-Name: Ali Shojaie Author-X-Name-First: Ali Author-X-Name-Last: Shojaie Title: Discussion of “A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression” Journal: Journal of the American Statistical Association Pages: 1717-1719 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2020.1837139 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1837139 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1717-1719 Template-Type: ReDIF-Article 1.0 Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Cong Ma Author-X-Name-First: Cong Author-X-Name-Last: Ma Author-Name: Kaizheng Wang Author-X-Name-First: Kaizheng Author-X-Name-Last: Wang Title: Comment on “A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression” Journal: Journal of the American Statistical Association Pages: 1720-1725 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2020.1837138 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1837138 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1720-1725 Template-Type: ReDIF-Article 1.0 Author-Name: Lan Wang Author-X-Name-First: Lan Author-X-Name-Last: Wang Author-Name: Bo Peng Author-X-Name-First: Bo Author-X-Name-Last: Peng Author-Name: Jelena Bradic Author-X-Name-First: Jelena Author-X-Name-Last: Bradic Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Author-Name: Yunan Wu Author-X-Name-First: Yunan Author-X-Name-Last: Wu Title: Rejoinder to “A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression” Journal: Journal of the American Statistical Association Pages: 1726-1729 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2020.1843865 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1843865 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1726-1729 Template-Type: ReDIF-Article 1.0 Author-Name: AlexanderM. Franks Author-X-Name-First: AlexanderM. Author-X-Name-Last: Franks Author-Name: Alexander D’Amour Author-X-Name-First: Alexander Author-X-Name-Last: D’Amour Author-Name: Avi Feller Author-X-Name-First: Avi Author-X-Name-Last: Feller Title: Flexible Sensitivity Analysis for Observational Studies Without Observable Implications Abstract: A fundamental challenge in observational causal inference is that assumptions about unconfoundedness are not testable from data. Assessing sensitivity to such assumptions is therefore important in practice. Unfortunately, some existing sensitivity analysis approaches inadvertently impose restrictions that are at odds with modern causal inference methods, which emphasize flexible models for observed data. To address this issue, we propose a framework that allows (1) flexible models for the observed data and (2) clean separation of the identified and unidentified parts of the sensitivity model. Our framework extends an approach from the missing data literature, known as Tukey’s factorization, to the causal inference setting. Under this factorization, we can represent the distributions of unobserved potential outcomes in terms of unidentified selection functions that posit a relationship between treatment assignment and unobserved potential outcomes. The sensitivity parameters in this framework are easily interpreted, and we provide heuristics for calibrating these parameters against observable quantities. We demonstrate the flexibility of this approach in two examples, where we estimate both average treatment effects and quantile treatment effects using Bayesian nonparametric models for the observed data. Journal: Journal of the American Statistical Association Pages: 1730-1746 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1604369 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604369 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1730-1746 Template-Type: ReDIF-Article 1.0 Author-Name: Leying Guan Author-X-Name-First: Leying Author-X-Name-Last: Guan Author-Name: Xi Chen Author-X-Name-First: Xi Author-X-Name-Last: Chen Author-Name: Wing Hung Wong Author-X-Name-First: Wing Author-X-Name-Last: Hung Wong Title: Detecting Strong Signals in Gene Perturbation Experiments: An Adaptive Approach With Power Guarantee and FDR Control Abstract: The perturbation of a transcription factor should affect the expression levels of its direct targets. However, not all genes showing changes in expression are direct targets. To increase the chance of detecting direct targets, we propose a modified two-group model where the null group corresponds to genes which are not direct targets, but can have small nonzero effects. We model the behavior of genes from the null set by a Gaussian distribution with unknown variance τ2 . To estimate τ2 , we focus on a simple estimation approach, the iterated empirical Bayes estimation. We conduct a detailed analysis of the properties of the iterated EB estimate and provide theoretical guarantee of its good performance under mild conditions. We provide simulations comparing the new modeling approach with existing methods, and the new approach shows more stable and better performance under different situations. We also apply it to a real dataset from gene knock-down experiments and obtained better results compared with the original two-group model testing for nonzero effects. Journal: Journal of the American Statistical Association Pages: 1747-1755 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1635484 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635484 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1747-1755 Template-Type: ReDIF-Article 1.0 Author-Name: Yunxiao Chen Author-X-Name-First: Yunxiao Author-X-Name-Last: Chen Author-Name: Xiaoou Li Author-X-Name-First: Xiaoou Author-X-Name-Last: Li Author-Name: Siliang Zhang Author-X-Name-First: Siliang Author-X-Name-Last: Zhang Title: Structured Latent Factor Analysis for Large-scale Data: Identifiability, Estimability, and Their Implications Abstract: Abstract–Latent factor models are widely used to measure unobserved latent traits in social and behavioral sciences, including psychology, education, and marketing. When used in a confirmatory manner, design information is incorporated as zero constraints on corresponding parameters, yielding structured (confirmatory) latent factor models. In this article, we study how such design information affects the identifiability and the estimation of a structured latent factor model. Insights are gained through both asymptotic and nonasymptotic analyses. Our asymptotic results are established under a regime where both the number of manifest variables and the sample size diverge, motivated by applications to large-scale data. Under this regime, we define the structural identifiability of the latent factors and establish necessary and sufficient conditions that ensure structural identifiability. In addition, we propose an estimator which is shown to be consistent and rate optimal when structural identifiability holds. Finally, a nonasymptotic error bound is derived for this estimator, through which the effect of design information is further quantified. Our results shed lights on the design of large-scale measurement in education and psychology and have important implications on measurement validity and reliability. Journal: Journal of the American Statistical Association Pages: 1756-1770 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1635485 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635485 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1756-1770 Template-Type: ReDIF-Article 1.0 Author-Name: Jianwei Hu Author-X-Name-First: Jianwei Author-X-Name-Last: Hu Author-Name: Hong Qin Author-X-Name-First: Hong Author-X-Name-Last: Qin Author-Name: Ting Yan Author-X-Name-First: Ting Author-X-Name-Last: Yan Author-Name: Yunpeng Zhao Author-X-Name-First: Yunpeng Author-X-Name-Last: Zhao Title: Corrected Bayesian Information Criterion for Stochastic Block Models Abstract: Estimating the number of communities is one of the fundamental problems in community detection. We re-examine the Bayesian paradigm for stochastic block models (SBMs) and propose a “corrected Bayesian information criterion” (CBIC), to determine the number of communities and show that the proposed criterion is consistent under mild conditions as the size of the network and the number of communities go to infinity. The CBIC outperforms those used in Wang and Bickel and Saldana, Yu, and Feng which tend to underestimate and overestimate the number of communities, respectively. The results are further extended to degree corrected SBMs. Numerical studies demonstrate our theoretical results. Journal: Journal of the American Statistical Association Pages: 1771-1783 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1637744 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1637744 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1771-1783 Template-Type: ReDIF-Article 1.0 Author-Name: Minsuk Shin Author-X-Name-First: Minsuk Author-X-Name-Last: Shin Author-Name: Anirban Bhattacharya Author-X-Name-First: Anirban Author-X-Name-Last: Bhattacharya Author-Name: Valen E. Johnson Author-X-Name-First: Valen E. Author-X-Name-Last: Johnson Title: Functional Horseshoe Priors for Subspace Shrinkage Abstract: We introduce a new shrinkage prior on function spaces, called the functional horseshoe (fHS) prior, that encourages shrinkage toward parametric classes of functions. Unlike other shrinkage priors for parametric models, the fHS shrinkage acts on the shape of the function rather than inducing sparsity on model parameters. We study the efficacy of the proposed approach by showing an adaptive posterior concentration property on the function. We also demonstrate consistency of the model selection procedure that thresholds the shrinkage parameter of the fHS prior. We apply the fHS prior to nonparametric additive models and compare its performance with procedures based on the standard horseshoe prior and several penalized likelihood approaches. We find that the new procedure achieves smaller estimation error and more accurate model selection than other procedures in several simulated and real examples. Supplementary materials for this article, which contain additional simulated and real data examples, MCMC diagnostics, and proofs of the theoretical results, are available online. Journal: Journal of the American Statistical Association Pages: 1784-1797 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1654875 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1654875 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1784-1797 Template-Type: ReDIF-Article 1.0 Author-Name: Xiao Nie Author-X-Name-First: Xiao Author-X-Name-Last: Nie Author-Name: Peter Chien Author-X-Name-First: Peter Author-X-Name-Last: Chien Author-Name: Dane Morgan Author-X-Name-First: Dane Author-X-Name-Last: Morgan Author-Name: Amy Kaczmarowski Author-X-Name-First: Amy Author-X-Name-Last: Kaczmarowski Title: A Statistical Method for Emulation of Computer Models With Invariance-Preserving Properties, With Application to Structural Energy Prediction Abstract: Statistical design and analysis of computer experiments is a growing area in statistics. Computer models with structural invariance properties now appear frequently in materials science, physics, biology, and other fields. These properties are consequences of dependency on structural geometry, and cannot be accommodated by standard statistical emulation methods. In this article, we propose a statistical framework for building emulators to preserve invariance. The framework uses a weighted complete graph to represent the geometry and introduces a new class of function, called the relabeling symmetric functions, associated with the graph. We establish a characterization theorem of the relabeling symmetric functions and propose a nonparametric kernel method for estimating such functions. The effectiveness of the proposed method is illustrated by examples from materials science. Supplemental material for this article can be found online. Journal: Journal of the American Statistical Association Pages: 1798-1811 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1654876 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1654876 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1798-1811 Template-Type: ReDIF-Article 1.0 Author-Name: A. S. Hedayat Author-X-Name-First: A. S. Author-X-Name-Last: Hedayat Author-Name: Heng Xu Author-X-Name-First: Heng Author-X-Name-Last: Xu Author-Name: Wei Zheng Author-X-Name-First: Wei Author-X-Name-Last: Zheng Title: Optimal Designs for the Two-Dimensional Interference Model Abstract: Recently, there have been some major advances in the theory of optimal designs for interference models when the block is arranged in one-dimensional layout. Relatively speaking, the study for two-dimensional interference model is quite limited partly due to technical difficulties. This article tries to fill this gap. Specifically, we set the tone by characterizing all possible universally optimal designs simultaneously through one linear equations system (LES) with respect to the proportions of block arrays. However, such a LES is not readily solvable due to the extremely large number of block arrays. This computational issue could be resolved by identifying a small subset of block arrays with the theoretical guarantee that any optimal design is supported by this subset. The nature of two-dimensional layout of the block has made this task very technically challenging, and we have theoretically derived such subset for any size of the treatment array and any number of treatments under comparison. This facilitates the development of the algorithm for deriving either approximate or exact designs. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1812-1821 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1654877 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1654877 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1812-1821 Template-Type: ReDIF-Article 1.0 Author-Name: Yingying Fan Author-X-Name-First: Yingying Author-X-Name-Last: Fan Author-Name: Jinchi Lv Author-X-Name-First: Jinchi Author-X-Name-Last: Lv Author-Name: Mahrad Sharifvaghefi Author-X-Name-First: Mahrad Author-X-Name-Last: Sharifvaghefi Author-Name: Yoshimasa Uematsu Author-X-Name-First: Yoshimasa Author-X-Name-Last: Uematsu Title: IPAD: Stable Interpretable Forecasting with Knockoffs Inference Abstract: Interpretability and stability are two important features that are desired in many contemporary big data applications arising in statistics, economics, and finance. While the former is enjoyed to some extent by many existing forecasting approaches, the latter in the sense of controlling the fraction of wrongly discovered features which can enhance greatly the interpretability is still largely underdeveloped. To this end, in this article, we exploit the general framework of model-X knockoffs introduced recently in Candès, Fan, Janson and Lv [(2018), “Panning for Gold: ‘model X’ Knockoffs for High Dimensional Controlled Variable Selection,” Journal of the Royal Statistical Society, Series B, 80, 551–577], which is nonconventional for reproducible large-scale inference in that the framework is completely free of the use of p-values for significance testing, and suggest a new method of intertwined probabilistic factors decoupling (IPAD) for stable interpretable forecasting with knockoffs inference in high-dimensional models. The recipe of the method is constructing the knockoff variables by assuming a latent factor model that is exploited widely in economics and finance for the association structure of covariates. Our method and work are distinct from the existing literature in which we estimate the covariate distribution from data instead of assuming that it is known when constructing the knockoff variables, our procedure does not require any sample splitting, we provide theoretical justifications on the asymptotic false discovery rate control, and the theory for the power analysis is also established. Several simulation examples and the real data analysis further demonstrate that the newly suggested method has appealing finite-sample performance with desired interpretability and stability compared to some popularly used forecasting methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1822-1834 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1654878 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1654878 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1822-1834 Template-Type: ReDIF-Article 1.0 Author-Name: Jelena Bradic Author-X-Name-First: Jelena Author-X-Name-Last: Bradic Author-Name: Gerda Claeskens Author-X-Name-First: Gerda Author-X-Name-Last: Claeskens Author-Name: Thomas Gueuning Author-X-Name-First: Thomas Author-X-Name-Last: Gueuning Title: Fixed Effects Testing in High-Dimensional Linear Mixed Models Abstract: Many scientific and engineering challenges—ranging from pharmacokinetic drug dosage allocation and personalized medicine to marketing mix (4Ps) recommendations—require an understanding of the unobserved heterogeneity to develop the best decision making-processes. In this article, we develop a hypothesis test and the corresponding p-value for testing for the significance of the homogeneous structure in linear mixed models. A robust matching moment construction is used for creating a test that adapts to the size of the model sparsity. When unobserved heterogeneity at a cluster level is constant, we show that our test is both consistent and unbiased even when the dimension of the model is extremely high. Our theoretical results rely on a new family of adaptive sparse estimators of the fixed effects that do not require consistent estimation of the random effects. Moreover, our inference results do not require consistent model selection. We showcase that moment matching can be extended to nonlinear mixed effects models and to generalized linear mixed effects models. In numerical and real data experiments, we find that the developed method is extremely accurate, that it adapts to the size of the underlying model and is decidedly powerful in the presence of irrelevant covariates.Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1835-1850 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1660172 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1660172 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1835-1850 Template-Type: ReDIF-Article 1.0 Author-Name: Xinwei Ma Author-X-Name-First: Xinwei Author-X-Name-Last: Ma Author-Name: Jingshen Wang Author-X-Name-First: Jingshen Author-X-Name-Last: Wang Title: Robust Inference Using Inverse Probability Weighting Abstract: Inverse probability weighting (IPW) is widely used in empirical work in economics and other disciplines. As Gaussian approximations perform poorly in the presence of “small denominators,” trimming is routinely employed as a regularization strategy. However, ad hoc trimming of the observations renders usual inference procedures invalid for the target estimand, even in large samples. In this article, we first show that the IPW estimator can have different (Gaussian or non-Gaussian) asymptotic distributions, depending on how “close to zero” the probability weights are and on how large the trimming threshold is. As a remedy, we propose an inference procedure that is robust not only to small probability weights entering the IPW estimator but also to a wide range of trimming threshold choices, by adapting to these different asymptotic distributions. This robustness is achieved by employing resampling techniques and by correcting a non-negligible trimming bias. We also propose an easy-to-implement method for choosing the trimming threshold by minimizing an empirical analogue of the asymptotic mean squared error. In addition, we show that our inference procedure remains valid with the use of a data-driven trimming threshold. We illustrate our method by revisiting a dataset from the National Supported Work program. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1851-1860 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1660173 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1660173 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1851-1860 Template-Type: ReDIF-Article 1.0 Author-Name: Yaniv Romano Author-X-Name-First: Yaniv Author-X-Name-Last: Romano Author-Name: Matteo Sesia Author-X-Name-First: Matteo Author-X-Name-Last: Sesia Author-Name: Emmanuel Candès Author-X-Name-First: Emmanuel Author-X-Name-Last: Candès Title: Deep Knockoffs Abstract: This article introduces a machine for sampling approximate model-X knockoffs for arbitrary and unspecified data distributions using deep generative models. The main idea is to iteratively refine a knockoff sampling mechanism until a criterion measuring the validity of the produced knockoffs is optimized; this criterion is inspired by the popular maximum mean discrepancy in machine learning and can be thought of as measuring the distance to pairwise exchangeability between original and knockoff features. By building upon the existing model-X framework, we thus obtain a flexible and model-free statistical tool to perform controlled variable selection. Extensive numerical experiments and quantitative tests confirm the generality, effectiveness, and power of our deep knockoff machines. Finally, we apply this new method to a real study of mutations linked to changes in drug resistance in the human immunodeficiency virus. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1861-1872 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1660174 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1660174 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1861-1872 Template-Type: ReDIF-Article 1.0 Author-Name: Eduardo García-Portugués Author-X-Name-First: Eduardo Author-X-Name-Last: García-Portugués Author-Name: Davy Paindaveine Author-X-Name-First: Davy Author-X-Name-Last: Paindaveine Author-Name: Thomas Verdebout Author-X-Name-First: Thomas Author-X-Name-Last: Verdebout Title: On Optimal Tests for Rotational Symmetry Against New Classes of Hyperspherical Distributions Abstract: Motivated by the central role played by rotationally symmetric distributions in directional statistics, we consider the problem of testing rotational symmetry on the hypersphere. We adopt a semiparametric approach and tackle problems where the location of the symmetry axis is either specified or unspecified. For each problem, we define two tests and study their asymptotic properties under very mild conditions. We introduce two new classes of directional distributions that extend the rotationally symmetric class and are of independent interest. We prove that each test is locally asymptotically maximin, in the Le Cam sense, for one kind of the alternatives given by the new classes of distributions, for both specified and unspecified symmetry axis. The tests, aimed to detect location- and scatter-like alternatives, are combined into convenient hybrid tests that are consistent against both alternatives. We perform Monte Carlo experiments that illustrate the finite-sample performances of the proposed tests and their agreement with the asymptotic results. Finally, the practical relevance of our tests is illustrated on a real data application from astronomy. The R package rotasym implements the proposed tests and allows practitioners to reproduce the data application. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1873-1887 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1665527 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1665527 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1873-1887 Template-Type: ReDIF-Article 1.0 Author-Name: S. R. Johnson Author-X-Name-First: S. R. Author-X-Name-Last: Johnson Author-Name: D. A. Henderson Author-X-Name-First: D. A. Author-X-Name-Last: Henderson Author-Name: R. J. Boys Author-X-Name-First: R. J. Author-X-Name-Last: Boys Title: Revealing Subgroup Structure in Ranked Data Using a Bayesian WAND Abstract: Ranked data arise in many areas of application ranging from the ranking of up-regulated genes for cancer to the ranking of academic statistics journals. Complications can arise when rankers do not report a full ranking of all entities; for example, they might only report their top-M ranked entities after seeing some or all entities. It can also be useful to know whether rankers are equally informative, and whether some entities are effectively judged to be exchangeable. Revealing subgroup structure in the data may also be helpful in understanding the distribution of ranker views. In this paper, we propose a flexible Bayesian nonparametric model for identifying heterogeneous structure and ranker reliability in ranked data. The model is a weighted adapted nested Dirichlet (WAND) process mixture of Plackett–Luce models and inference proceeds through a simple and efficient Gibbs sampling scheme for posterior sampling. The richness of information in the posterior distribution allows us to infer many details of the structure both between ranker groups and between entity groups (within-ranker groups). Our modeling framework also facilitates a flexible representation of the posterior predictive distribution. This flexibility is important as we propose to use the posterior predictive distribution as the basis for addressing the rank aggregation problem, and also for identifying lack of model fit. The methodology is illustrated using several simulation studies and real data examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1888-1901 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1665528 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1665528 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1888-1901 Template-Type: ReDIF-Article 1.0 Author-Name: P. Hall Author-X-Name-First: P. Author-X-Name-Last: Hall Author-Name: I.M. Johnstone Author-X-Name-First: I.M. Author-X-Name-Last: Johnstone Author-Name: J.T. Ormerod Author-X-Name-First: J.T. Author-X-Name-Last: Ormerod Author-Name: M.P. Wand Author-X-Name-First: M.P. Author-X-Name-Last: Wand Author-Name: J.C.F. Yu Author-X-Name-First: J.C.F. Author-X-Name-Last: Yu Title: Fast and Accurate Binary Response Mixed Model Analysis via Expectation Propagation Abstract: Expectation propagation is a general prescription for approximation of integrals in statistical inference problems. Its literature is mainly concerned with Bayesian inference scenarios. However, expectation propagation can also be used to approximate integrals arising in frequentist statistical inference. We focus on likelihood-based inference for binary response mixed models and show that fast and accurate quadrature-free inference can be realized for the probit link case with multivariate random effects and higher levels of nesting. The approach is supported by asymptotic calculations in which expectation propagation is seen to provide consistent estimation of the exact likelihood surface. Numerical studies reveal the availability of fast, highly accurate and scalable methodology for binary mixed model analysis. Journal: Journal of the American Statistical Association Pages: 1902-1916 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1665529 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1665529 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1902-1916 Template-Type: ReDIF-Article 1.0 Author-Name: David Benkeser Author-X-Name-First: David Author-X-Name-Last: Benkeser Author-Name: Maya Petersen Author-X-Name-First: Maya Author-X-Name-Last: Petersen Author-Name: Mark J. van der Laan Author-X-Name-First: Mark J. Author-X-Name-Last: van der Laan Title: Improved Small-Sample Estimation of Nonlinear Cross-Validated Prediction Metrics Abstract: When predicting an outcome is the scientific goal, one must decide on a metric by which to evaluate the quality of predictions. We consider the problem of measuring the performance of a prediction algorithm with the same data that were used to train the algorithm. Typical approaches involve bootstrapping or cross-validation. However, we demonstrate that bootstrap-based approaches often fail and standard cross-validation estimators may perform poorly. We provide a general study of cross-validation-based estimators that highlights the source of this poor performance, and propose an alternative framework for estimation using techniques from the efficiency theory literature. We provide a theorem establishing the weak convergence of our estimators. The general theorem is applied in detail to two specific examples and we discuss possible extensions to other parameters of interest. For the two explicit examples that we consider, our estimators demonstrate remarkable finite-sample improvements over standard approaches. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1917-1932 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1668794 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1668794 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1917-1932 Template-Type: ReDIF-Article 1.0 Author-Name: Scott A. Bruce Author-X-Name-First: Scott A. Author-X-Name-Last: Bruce Author-Name: Cheng Yong Tang Author-X-Name-First: Cheng Yong Author-X-Name-Last: Tang Author-Name: Martica H. Hall Author-X-Name-First: Martica H. Author-X-Name-Last: Hall Author-Name: Robert T. Krafty Author-X-Name-First: Robert T. Author-X-Name-Last: Krafty Title: Empirical Frequency Band Analysis of Nonstationary Time Series Abstract: The time-varying power spectrum of a time series process is a bivariate function that quantifies the magnitude of oscillations at different frequencies and times. To obtain low-dimensional, parsimonious measures from this functional parameter, applied researchers consider collapsed measures of power within local bands that partition the frequency space. Frequency bands commonly used in the scientific literature were historically derived, but they are not guaranteed to be optimal or justified for adequately summarizing information from a given time series process under current study. There is a dearth of methods for empirically constructing statistically optimal bands for a given signal. The goal of this article is to provide a standardized, unifying approach for deriving and analyzing customized frequency bands. A consistent, frequency-domain, iterative cumulative sum based scanning procedure is formulated to identify frequency bands that best preserve nonstationary information. A formal hypothesis testing procedure is also developed to test which, if any, frequency bands remain stationary. The proposed method is used to analyze heart rate variability of a patient during sleep and uncovers a refined partition of frequency bands that best summarize the time-varying power spectrum. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1933-1945 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1671199 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1671199 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1933-1945 Template-Type: ReDIF-Article 1.0 Author-Name: Ran Tao Author-X-Name-First: Ran Author-X-Name-Last: Tao Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Dan-Yu Lin Author-X-Name-First: Dan-Yu Author-X-Name-Last: Lin Title: Optimal Designs of Two-Phase Studies Abstract: The two-phase design is a cost-effective sampling strategy to evaluate the effects of covariates on an outcome when certain covariates are too expensive to be measured on all study subjects. Under such a design, the outcome and inexpensive covariates are measured on all subjects in the first phase and the first-phase information is used to select subjects for measurements of expensive covariates in the second phase. Previous research on two-phase studies has focused largely on the inference procedures rather than the design aspects. We investigate the design efficiency of the two-phase study, as measured by the semiparametric efficiency bound for estimating the regression coefficients of expensive covariates. We consider general two-phase studies, where the outcome variable can be continuous, discrete, or censored, and the second-phase sampling can depend on the first-phase data in any manner. We develop optimal or approximately optimal two-phase designs, which can be substantially more efficient than the existing designs. We demonstrate the improvements of the new designs over the existing ones through extensive simulation studies and two large medical studies. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1946-1959 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1671200 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1671200 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1946-1959 Template-Type: ReDIF-Article 1.0 Author-Name: Dachuan Chen Author-X-Name-First: Dachuan Author-X-Name-Last: Chen Author-Name: Per A. Mykland Author-X-Name-First: Per A. Author-X-Name-Last: Mykland Author-Name: Lan Zhang Author-X-Name-First: Lan Author-X-Name-Last: Zhang Title: The Five Trolls Under the Bridge: Principal Component Analysis With Asynchronous and Noisy High Frequency Data Abstract: We develop a principal component analysis (PCA) for high frequency data. As in Northern fairy tales, there are trolls waiting for the explorer. The first three trolls are market microstructure noise, asynchronous sampling times, and edge effects in estimators. To get around these, a robust estimator of the spot covariance matrix is developed based on the smoothed two-scale realized variance (S-TSRV). The fourth troll is how to pass from estimated time-varying covariance matrix to PCA. Under finite dimensionality, we develop this methodology through the estimation of realized spectral functions. Rates of convergence and central limit theory, as well as an estimator of standard error, are established. The fifth troll is high dimension on top of high frequency, where we also develop PCA. With the help of a new identity concerning the spot principal orthogonal complement, the high-dimensional rates of convergence have been studied after eliminating several strong assumptions in classical PCA. As an application, we show that our first principal component (PC) closely matches but potentially outperforms the S&P 100 market index. From a statistical standpoint, the close match between the first PC and the market index also corroborates this PCA procedure and the underlying S-TSRV matrix, in the sense of Karl Popper.Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1960-1977 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1672555 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1672555 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1960-1977 Template-Type: ReDIF-Article 1.0 Author-Name: Jing Lei Author-X-Name-First: Jing Author-X-Name-Last: Lei Title: Cross-Validation With Confidence Abstract: Cross-validation is one of the most popular model and tuning parameter selection methods in statistics and machine learning. Despite its wide applicability, traditional cross-validation methods tend to overfit, due to the ignorance of the uncertainty in the testing sample. We develop a novel statistically principled inference tool based on cross-validation that takes into account the uncertainty in the testing sample. This method outputs a set of highly competitive candidate models containing the optimal one with guaranteed probability. As a consequence, our method can achieve consistent variable selection in a classical linear regression setting, for which existing cross-validation methods require unconventional split ratios. When used for tuning parameter selection, the method can provide an alternative trade-off between prediction accuracy and model interpretability than existing variants of cross-validation. We demonstrate the performance of the proposed method in several simulated and real data examples. Supplemental materials for this article can be found online. Journal: Journal of the American Statistical Association Pages: 1978-1997 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1672556 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1672556 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1978-1997 Template-Type: ReDIF-Article 1.0 Author-Name: Minerva Mukhopadhyay Author-X-Name-First: Minerva Author-X-Name-Last: Mukhopadhyay Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Targeted Random Projection for Prediction From High-Dimensional Features Abstract: We consider the problem of computationally efficient prediction with high dimensional and highly correlated predictors when accurate variable selection is effectively impossible. Direct application of penalization or Bayesian methods implemented with Markov chain Monte Carlo can be computationally daunting and unstable. A common solution is first stage dimension reduction through screening or projecting the design matrix to a lower dimensional hyper-plane. Screening is highly sensitive to threshold choice, while projections often have poor performance in very high-dimensions. We propose targeted random projection (TARP) to combine positive aspects of both strategies. TARP uses screening to order the inclusion probabilities of the features in the projection matrix used for dimension reduction, leading to data-informed sparsity. We provide theoretical support for a Bayesian predictive algorithm based on TARP, including statistical and computational complexity guarantees. Examples for simulated and real data applications illustrate gains relative to a variety of competitors. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1998-2010 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1677240 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1677240 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1998-2010 Template-Type: ReDIF-Article 1.0 Author-Name: Yilin Chen Author-X-Name-First: Yilin Author-X-Name-Last: Chen Author-Name: Pengfei Li Author-X-Name-First: Pengfei Author-X-Name-Last: Li Author-Name: Changbao Wu Author-X-Name-First: Changbao Author-X-Name-Last: Wu Title: Doubly Robust Inference With Nonprobability Survey Samples Abstract: We establish a general framework for statistical inferences with nonprobability survey samples when relevant auxiliary information is available from a probability survey sample. We develop a rigorous procedure for estimating the propensity scores for units in the nonprobability sample, and construct doubly robust estimators for the finite population mean. Variance estimation is discussed under the proposed framework. Results from simulation studies show the robustness and the efficiency of our proposed estimators as compared to existing methods. The proposed method is used to analyze a nonprobability survey sample collected by the Pew Research Center with auxiliary information from the Behavioral Risk Factor Surveillance System and the Current Population Survey. Our results illustrate a general approach to inference with nonprobability samples and highlight the importance and usefulness of auxiliary information from probability survey samples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2011-2021 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1677241 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1677241 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2011-2021 Template-Type: ReDIF-Article 1.0 Author-Name: Jingfei Zhang Author-X-Name-First: Jingfei Author-X-Name-Last: Zhang Author-Name: Will Wei Sun Author-X-Name-First: Will Wei Author-X-Name-Last: Sun Author-Name: Lexin Li Author-X-Name-First: Lexin Author-X-Name-Last: Li Title: Mixed-Effect Time-Varying Network Model and Application in Brain Connectivity Analysis Abstract: Time-varying networks are fast emerging in a wide range of scientific and business applications. Most existing dynamic network models are limited to a single-subject and discrete-time setting. In this article, we propose a mixed-effect network model that characterizes the continuous time-varying behavior of the network at the population level, meanwhile taking into account both the individual subject variability as well as the prior module information. We develop a multistep optimization procedure for a constrained likelihood estimation and derive the associated asymptotic properties. We demonstrate the effectiveness of our method through both simulations and an application to a study of brain development in youth. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2022-2036 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1677242 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1677242 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2022-2036 Template-Type: ReDIF-Article 1.0 Author-Name: Jonathan R. Bradley Author-X-Name-First: Jonathan R. Author-X-Name-Last: Bradley Author-Name: Scott H. Holan Author-X-Name-First: Scott H. Author-X-Name-Last: Holan Author-Name: Christopher K. Wikle Author-X-Name-First: Christopher K. Author-X-Name-Last: Wikle Title: Bayesian Hierarchical Models With Conjugate Full-Conditional Distributions for Dependent Data From the Natural Exponential Family Abstract: We introduce a Bayesian approach for analyzing (possibly) high-dimensional dependent data that are distributed according to a member from the natural exponential family of distributions. This problem requires extensive methodological advancements, as jointly modeling high-dimensional dependent data leads to the so-called “big n problem.” The computational complexity of the “big n problem” is further exacerbated when allowing for non-Gaussian data models, as is the case here. Thus, we develop new computationally efficient distribution theory for this setting. In particular, we introduce the “conjugate multivariate distribution,” which is motivated by the Diaconis and Ylvisaker distribution. Furthermore, we provide substantial theoretical and methodological development including: results regarding conditional distributions, an asymptotic relationship with the multivariate normal distribution, conjugate prior distributions, and full-conditional distributions for a Gibbs sampler. To demonstrate the wide-applicability of the proposed methodology, we provide two simulation studies and three applications based on an epidemiology dataset, a federal statistics dataset, and an environmental dataset, respectively. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2037-2052 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1677471 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1677471 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2037-2052 Template-Type: ReDIF-Article 1.0 Author-Name: Trambak Banerjee Author-X-Name-First: Trambak Author-X-Name-Last: Banerjee Author-Name: Gourab Mukherjee Author-X-Name-First: Gourab Author-X-Name-Last: Mukherjee Author-Name: Wenguang Sun Author-X-Name-First: Wenguang Author-X-Name-Last: Sun Title: Adaptive Sparse Estimation With Side Information Abstract: The article considers the problem of estimating a high-dimensional sparse parameter in the presence of side information that encodes the sparsity structure. We develop a general framework that involves first using an auxiliary sequence to capture the side information, and then incorporating the auxiliary sequence in inference to reduce the estimation risk. The proposed method, which carries out adaptive Stein’s unbiased risk estimate-thresholding using side information (ASUS), is shown to have robust performance and enjoy optimality properties. We develop new theories to characterize regimes in which ASUS far outperforms competitive shrinkage estimators, and establish precise conditions under which ASUS is asymptotically optimal. Simulation studies are conducted to show that ASUS substantially improves the performance of existing methods in many settings. The methodology is applied for analysis of data from single cell virology studies and microarray time course experiments. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2053-2067 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1679639 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1679639 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2053-2067 Template-Type: ReDIF-Article 1.0 Author-Name: Kathleen T. Li Author-X-Name-First: Kathleen T. Author-X-Name-Last: Li Title: Statistical Inference for Average Treatment Effects Estimated by Synthetic Control Methods Abstract: The synthetic control (SC) method, a powerful tool for estimating average treatment effects (ATE), is increasingly popular in fields such as statistics, economics, political science, and marketing. The SC is particularly suitable for estimating ATE with a single (or a few) treated unit(s), a fixed number of control units, and large pre and post-treatment periods (which we refer as “long panels”). To date, there has been no formal inference theory for SC ATE estimator with long panels under general conditions. Existing work mostly use placebo tests for inference or some permutation methods when the post-treatment period is small. In this article, we derive the asymptotic distribution of the SC and modified synthetic control (MSC) ATE estimators using projection theory. We show that a properly designed subsampling method can be used to obtain confidence intervals and conduct inference whereas the standard bootstrap cannot. Simulations and an empirical application that examines the effect of opening a physical showroom by an e-tailer demonstrate the usefulness of the MSC method in applications.Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2068-2083 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1686986 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1686986 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2068-2083 Template-Type: ReDIF-Article 1.0 Author-Name: Junwei Lu Author-X-Name-First: Junwei Author-X-Name-Last: Lu Author-Name: Mladen Kolar Author-X-Name-First: Mladen Author-X-Name-Last: Kolar Author-Name: Han Liu Author-X-Name-First: Han Author-X-Name-Last: Liu Title: Kernel Meets Sieve: Post-Regularization Confidence Bands for Sparse Additive Model Abstract: We develop a novel procedure for constructing confidence bands for components of a sparse additive model. Our procedure is based on a new kernel-sieve hybrid estimator that combines two most popular nonparametric estimation methods in the literature, the kernel regression and the spline method, and is of interest in its own right. Existing methods for fitting sparse additive model are primarily based on sieve estimators, while the literature on confidence bands for nonparametric models are primarily based upon kernel or local polynomial estimators. Our kernel-sieve hybrid estimator combines the best of both worlds and allows us to provide a simple procedure for constructing confidence bands in high-dimensional sparse additive models. We prove that the confidence bands are asymptotically honest by studying approximation with a Gaussian process. Thorough numerical results on both synthetic data and real-world neuroscience data are provided to demonstrate the efficacy of the theory. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2084-2099 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2019.1689984 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1689984 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2084-2099 Template-Type: ReDIF-Article 1.0 Author-Name: Jordan J. Franks Author-X-Name-First: Jordan J. Author-X-Name-Last: Franks Title: Handbook of Approximate Bayesian Computation. Journal: Journal of the American Statistical Association Pages: 2100-2101 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2020.1846973 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1846973 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2100-2101 Template-Type: ReDIF-Article 1.0 Author-Name: Yen-Chi Chen Author-X-Name-First: Yen-Chi Author-X-Name-Last: Chen Title: Handbook of Mixture Analysis. Journal: Journal of the American Statistical Association Pages: 2101-2102 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2020.1846974 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1846974 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2101-2102 Template-Type: ReDIF-Article 1.0 Author-Name: Richard J. Cook Author-X-Name-First: Richard J. Author-X-Name-Last: Cook Title: The Statistical Analysis of Multivariate Failure Time Data: A Marginal Modeling Approach. Journal: Journal of the American Statistical Association Pages: 2102-2104 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2020.1846975 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1846975 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2102-2104 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Editorial Collaborators Journal: Journal of the American Statistical Association Pages: 2105-2113 Issue: 532 Volume: 115 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2020.1846977 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1846977 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2105-2113 Template-Type: ReDIF-Article 1.0 Author-Name: Andrew Gelman Author-X-Name-First: Andrew Author-X-Name-Last: Gelman Author-Name: Zaiying Huang Author-X-Name-First: Zaiying Author-X-Name-Last: Huang Title: Estimating Incumbency Advantage and Its Variation, as an Example of a Before–After Study Abstract: Incumbency advantage is one of the most widely studied features in American legislative elections. In this article we construct and implement an estimate that allows incumbency advantage to vary between individual incumbents. This model predicts that open-seat elections will be less variable than those with incumbents running, an observed empirical pattern that is not explained by previous models. We apply our method to the U.S. House of Representatives in the twentieth century. Our estimate of the overall pattern of incumbency advantage over time is similar to previous estimates (although slightly lower), and we also find a pattern of increasing variation. More generally, our multilevel model represents a new method for estimating effects in before–after studies. Journal: Journal of the American Statistical Association Issue: 482 Volume: 103 Year: 2008 Month: 9 X-DOI: 10.1198/016214507000000626 File-URL: http://hdl.handle.net/ File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:103:y:2008:i:482:p:437-446 Template-Type: ReDIF-Article 1.0 Author-Name: Zhuo Wang Author-X-Name-First: Zhuo Author-X-Name-Last: Wang Author-Name: Yujing Jiang Author-X-Name-First: Yujing Author-X-Name-Last: Jiang Author-Name: Hui Wan Author-X-Name-First: Hui Author-X-Name-Last: Wan Author-Name: Jun Yan Author-X-Name-First: Jun Author-X-Name-Last: Yan Author-Name: Xuebin Zhang Author-X-Name-First: Xuebin Author-X-Name-Last: Zhang Title: Toward Optimal Fingerprinting in Detection and Attribution of Changes in Climate Extremes Abstract: Abstract–Detection and attribution of climate change plays a central role in establishing the causal relationship between the observed changes in the climate and their possible causes. Optimal fingerprinting has been widely used as a standard method for detection and attribution analysis for mean climate conditions, but there has been no satisfactory analog for climate extremes. Here, we turn an intuitive concept, which incorporates the expected climate responses to external forcings into the location parameters of the marginal generalized extreme value (GEV) distributions of the observed extremes, to a practical and better-understood method. Marginal approaches based on a weighted sum of marginal GEV score equations are promising for no need to specify the dependence structure. The computational efficiency makes them feasible in handling multiple forcings simultaneously. The method under working independence is recommended because it produces robust results where there are errors-in-variables. Our analyses show human influences on temperature extremes at the subcontinental scale. Compared with previous studies, we detected human influences in a slightly smaller number of regions. This is possibly due to the under-coverage of the confidence intervals in existing works, suggesting the need for careful examinations of the properties of the statistical methods in practice. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1-13 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1730852 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1730852 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:1-13 Template-Type: ReDIF-Article 1.0 Author-Name: Seyoung Park Author-X-Name-First: Seyoung Author-X-Name-Last: Park Author-Name: Hao Xu Author-X-Name-First: Hao Author-X-Name-Last: Xu Author-Name: Hongyu Zhao Author-X-Name-First: Hongyu Author-X-Name-Last: Zhao Title: Integrating Multidimensional Data for Clustering Analysis With Applications to Cancer Patient Data Abstract: Advances in high-throughput genomic technologies coupled with large-scale studies including The Cancer Genome Atlas (TCGA) project have generated rich resources of diverse types of omics data to better understand cancer etiology and treatment responses. Clustering patients into subtypes with similar disease etiologies and/or treatment responses using multiple omics data types has the potential to improve the precision of clustering than using a single data type. However, in practice, patient clustering is still mostly based on a single type of omics data or ad hoc integration of clustering results from individual data types, leading to potential loss of information. By treating each omics data type as a different informative representation from patients, we propose a novel multi-view spectral clustering framework to integrate different omics data types measured from the same subject. We learn the weight of each data type as well as a similarity measure between patients via a nonconvex optimization framework. We solve the proposed nonconvex problem iteratively using the ADMM algorithm and show the convergence of the algorithm. The accuracy and robustness of the proposed clustering method is studied both in theory and through various synthetic data. When our method is applied to the TCGA data, the patient clusters inferred by our method show more significant differences in survival times between clusters than those inferred from existing clustering methods. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 14-26 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1730853 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1730853 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:14-26 Template-Type: ReDIF-Article 1.0 Author-Name: Souhaib Ben Taieb Author-X-Name-First: Souhaib Ben Author-X-Name-Last: Taieb Author-Name: James W. Taylor Author-X-Name-First: James W. Author-X-Name-Last: Taylor Author-Name: Rob J. Hyndman Author-X-Name-First: Rob J. Author-X-Name-Last: Hyndman Title: Hierarchical Probabilistic Forecasting of Electricity Demand With Smart Meter Data Abstract: Decisions regarding the supply of electricity across a power grid must take into consideration the inherent uncertainty in demand. Optimal decision-making requires probabilistic forecasts for demand in a hierarchy with various levels of aggregation, such as substations, cities, and regions. The forecasts should be coherent in the sense that the forecast of the aggregated series should equal the sum of the forecasts of the corresponding disaggregated series. Coherency is essential, since the allocation of electricity at one level of the hierarchy relies on the appropriate amount being provided from the previous level. We introduce a new probabilistic forecasting method for a large hierarchy based on UK residential smart meter data. We find our method provides coherent and accurate probabilistic forecasts, as a result of an effective forecast combination. Furthermore, by avoiding distributional assumptions, we find that our method captures the variety of distributions in the smart meter hierarchy. Finally, the results confirm that, to ensure coherency in our large-scale hierarchy, it is sufficient to model a set of lower-dimension dependencies, rather than modeling the entire joint distribution of all series in the hierarchy. In achieving coherent and accurate hierarchical probabilistic forecasts, this work contributes to improved decision-making for smart grids. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 27-43 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1736081 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1736081 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:27-43 Template-Type: ReDIF-Article 1.0 Author-Name: Giovanni Nattino Author-X-Name-First: Giovanni Author-X-Name-Last: Nattino Author-Name: Bo Lu Author-X-Name-First: Bo Author-X-Name-Last: Lu Author-Name: Junxin Shi Author-X-Name-First: Junxin Author-X-Name-Last: Shi Author-Name: Stanley Lemeshow Author-X-Name-First: Stanley Author-X-Name-Last: Lemeshow Author-Name: Henry Xiang Author-X-Name-First: Henry Author-X-Name-Last: Xiang Title: Triplet Matching for Estimating Causal Effects With Three Treatment Arms: A Comparative Study of Mortality by Trauma Center Level Abstract: Comparing outcomes across different levels of trauma centers is vital in evaluating regionalized trauma care. With observational data, it is critical to adjust for patient characteristics to render valid causal comparisons. Propensity score matching is a popular method to infer causal relationships in observational studies with two treatment arms. Few studies, however, have used matching designs with more than two groups, due to the complexity of matching algorithms. We fill the gap by developing an iterative matching algorithm for the three-group setting. Our algorithm outperforms the nearest neighbor algorithm and is shown to produce matched samples with total distance no larger than twice the optimal distance. We implement the evidence factors method for binary outcomes, which includes a randomization-based testing strategy and a sensitivity analysis for hidden bias in three-group matched designs. We apply our method to the Nationwide Emergency Department Sample data to compare emergency department mortality among non-trauma, level I, and level II trauma centers. Our tests suggest that the admission to a trauma center has a beneficial effect on mortality, assuming no unmeasured confounding. A sensitivity analysis for hidden bias shows that unmeasured confounders, moderately associated with the type of care received, may change the result qualitatively. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 44-53 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1737078 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1737078 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:44-53 Template-Type: ReDIF-Article 1.0 Author-Name: Kevin Z. Lin Author-X-Name-First: Kevin Z. Author-X-Name-Last: Lin Author-Name: Han Liu Author-X-Name-First: Han Author-X-Name-Last: Liu Author-Name: Kathryn Roeder Author-X-Name-First: Kathryn Author-X-Name-Last: Roeder Title: Covariance-Based Sample Selection for Heterogeneous Data: Applications to Gene Expression and Autism Risk Gene Detection Abstract: Risk for autism can be influenced by genetic mutations in hundreds of genes. Based on findings showing that genes with highly correlated gene expressions are functionally interrelated, “guilt by association” methods such as DAWN have been developed to identify these autism risk genes. Previous research analyzes the BrainSpan dataset, which contains gene expression of brain tissues from varying regions and developmental periods. Since the spatiotemporal properties of brain tissue are known to affect the gene expression’s covariance, previous research has focused only on a specific subset of samples to avoid the issue of heterogeneity. This analysis leads to a potential loss of power when detecting risk genes. In this article, we develop a new method called covariance-based sample selection (COBS) to find a larger and more homogeneous subset of samples that share the same population covariance matrix for the downstream DAWN analysis. To demonstrate COBSs effectiveness, we use genetic risk scores from two sequential data freezes obtained in 2014 and 2020. We show COBS improves DAWNs ability to predict risk genes detected in the newer data freeze when using the risk scores of the older data freeze as input. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 54-67 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1738234 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1738234 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:54-67 Template-Type: ReDIF-Article 1.0 Author-Name: Lucy Xia Author-X-Name-First: Lucy Author-X-Name-Last: Xia Author-Name: Richard Zhao Author-X-Name-First: Richard Author-X-Name-Last: Zhao Author-Name: Yanhui Wu Author-X-Name-First: Yanhui Author-X-Name-Last: Wu Author-Name: Xin Tong Author-X-Name-First: Xin Author-X-Name-Last: Tong Title: Intentional Control of Type I Error Over Unconscious Data Distortion: A Neyman–Pearson Approach to Text Classification Abstract: This article addresses the challenges in classifying textual data obtained from open online platforms, which are vulnerable to distortion. Most existing classification methods minimize the overall classification error and may yield an undesirably large Type I error (relevant textual messages are classified as irrelevant), particularly when available data exhibit an asymmetry between relevant and irrelevant information. Data distortion exacerbates this situation and often leads to fallacious prediction. To deal with inestimable data distortion, we propose the use of the Neyman–Pearson (NP) classification paradigm, which minimizes Type II error under a user-specified Type I error constraint. Theoretically, we show that the NP oracle is unaffected by data distortion when the class conditional distributions remain the same. Empirically, we study a case of classifying posts about worker strikes obtained from a leading Chinese microblogging platform, which are frequently prone to extensive, unpredictable and inestimable censorship. We demonstrate that, even though the training and test data are susceptible to different distortion and therefore potentially follow different distributions, our proposed NP methods control the Type I error on test data at the targeted level. The methods and implementation pipeline proposed in our case study are applicable to many other problems involving data distortion. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 68-81 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1740711 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1740711 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:68-81 Template-Type: ReDIF-Article 1.0 Author-Name: Bikram Karmakar Author-X-Name-First: Bikram Author-X-Name-Last: Karmakar Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Author-Name: Paul R. Rosenbaum Author-X-Name-First: Paul R. Author-X-Name-Last: Rosenbaum Title: Reinforced Designs: Multiple Instruments Plus Control Groups as Evidence Factors in an Observational Study of the Effectiveness of Catholic Schools Abstract: Absent randomization, causal conclusions gain strength if several independent evidence factors concur. We develop a method for constructing evidence factors from several instruments plus a direct comparison of treated and control groups, and we evaluate the methods performance in terms of design sensitivity and simulation. In the application, we consider the effectiveness of Catholic versus public high schools, constructing three evidence factors from three past strategies for studying this question, namely: (i) having nearby access to a Catholic school as an instrument, (ii) being Catholic as an instrument for attending Catholic school, and (iii) a direct comparison of students in Catholic and public high schools. Although these three analyses use the same data, we: (i) construct three essentially independent statistical tests of no effect that require very different assumptions, (ii) study the sensitivity of each test to the assumptions underlying that test, (iii) examine the degree to which independent tests dependent upon different assumptions concur, (iv) pool evidence across independent factors. In the application, we conclude that the ostensible benefit of Catholic education depends critically on the validity of one instrument, and is therefore quite fragile. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 82-92 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1745811 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1745811 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:82-92 Template-Type: ReDIF-Article 1.0 Author-Name: Gregory P. Bopp Author-X-Name-First: Gregory P. Author-X-Name-Last: Bopp Author-Name: Benjamin A. Shaby Author-X-Name-First: Benjamin A. Author-X-Name-Last: Shaby Author-Name: Raphaël Huser Author-X-Name-First: Raphaël Author-X-Name-Last: Huser Title: A Hierarchical Max-Infinitely Divisible Spatial Model for Extreme Precipitation Abstract: Understanding the spatial extent of extreme precipitation is necessary for determining flood risk and adequately designing infrastructure (e.g., stormwater pipes) to withstand such hazards. While environmental phenomena typically exhibit weakening spatial dependence at increasingly extreme levels, limiting max-stable process models for block maxima have a rigid dependence structure that does not capture this type of behavior. We propose a flexible Bayesian model from a broader family of (conditionally) max-infinitely divisible processes that allows for weakening spatial dependence at increasingly extreme levels, and due to a hierarchical representation of the likelihood in terms of random effects, our inference approach scales to large datasets. Therefore, our model not only has a flexible dependence structure, but it also allows for fast, fully Bayesian inference, prediction and conditional simulation in high dimensions. The proposed model is constructed using flexible random basis functions that are estimated from the data, allowing for straightforward inspection of the predominant spatial patterns of extremes. In addition, the described process possesses (conditional) max-stability as a special case, making inference on the tail dependence class possible. We apply our model to extreme precipitation in North-Eastern America, and show that the proposed model adequately captures the extremal behavior of the data. Interestingly, we find that the principal modes of spatial variation estimated from our model resemble observed patterns in extreme precipitation events occurring along the coast (e.g., with localized tropical cyclones and convective storms) and mountain range borders. Our model, which can easily be adapted to other types of environmental datasets, is therefore useful to identify extreme weather patterns and regions at risk. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 93-106 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1750414 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1750414 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:93-106 Template-Type: ReDIF-Article 1.0 Author-Name: R. Glennie Author-X-Name-First: R. Author-X-Name-Last: Glennie Author-Name: S. T. Buckland Author-X-Name-First: S. T. Author-X-Name-Last: Buckland Author-Name: R. Langrock Author-X-Name-First: R. Author-X-Name-Last: Langrock Author-Name: T. Gerrodette Author-X-Name-First: T. Author-X-Name-Last: Gerrodette Author-Name: L. T. Ballance Author-X-Name-First: L. T. Author-X-Name-Last: Ballance Author-Name: S. J. Chivers Author-X-Name-First: S. J. Author-X-Name-Last: Chivers Author-Name: M. D. Scott Author-X-Name-First: M. D. Author-X-Name-Last: Scott Title: Incorporating Animal Movement Into Distance Sampling Abstract: Distance sampling is a popular statistical method to estimate the density of wild animal populations. Conventional distance sampling represents animals as fixed points in space that are detected with an unknown probability that depends on the distance between the observer and the animal. Animal movement can cause substantial bias in density estimation. Methods to correct for responsive animal movement exist, but none account for nonresponsive movement independent of the observer. Here, an explicit animal movement model is incorporated into distance sampling, combining distance sampling survey data with animal telemetry data. Detection probability depends on the entire unobserved path the animal travels. The intractable integration over all possible animal paths is approximated by a hidden Markov model. A simulation study shows the method to be negligibly biased (<5%) in scenarios where conventional distance sampling overestimates abundance by up to 100%. The method is applied to line transect surveys (1999–2006) of spotted dolphins (Stenella attenuata) in the eastern tropical Pacific where abundance is shown to be positively biased by 21% on average, which can have substantial impact on the population dynamics estimated from these abundance estimates and on the choice of statistical methodology applied to future surveys. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 107-115 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1764362 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1764362 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:107-115 Template-Type: ReDIF-Article 1.0 Author-Name: Decai Liang Author-X-Name-First: Decai Author-X-Name-Last: Liang Author-Name: Haozhe Zhang Author-X-Name-First: Haozhe Author-X-Name-Last: Zhang Author-Name: Xiaohui Chang Author-X-Name-First: Xiaohui Author-X-Name-Last: Chang Author-Name: Hui Huang Author-X-Name-First: Hui Author-X-Name-Last: Huang Title: Modeling and Regionalization of China’s PM2.5 Using Spatial-Functional Mixture Models Abstract: Abstract–Severe air pollution affects billions of people around the world, particularly in developing countries such as China. Effective emission control policies rely primarily on a proper assessment of air pollutants and accurate spatial clustering outcomes. Unfortunately, emission patterns are difficult to observe as they are highly confounded by many meteorological and geographical factors. In this study, we propose a novel approach for modeling and clustering PM 2.5 concentrations across China. We model observed concentrations from monitoring stations as spatially dependent functional data and assume latent emission processes originate from a functional mixture model with each component as a spatio-temporal process. Cluster memberships of monitoring stations are modeled as a Markov random field, in which confounding effects are controlled through energy functions. The superior performance of our approach is demonstrated using extensive simulation studies. Our method is effective in dividing China and the Beijing-Tianjin-Hebei region into several regions based on PM 2.5 concentrations, suggesting that separate local emission control policies are needed. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 116-132 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1764363 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1764363 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:116-132 Template-Type: ReDIF-Article 1.0 Author-Name: Ting-Huei Chen Author-X-Name-First: Ting-Huei Author-X-Name-Last: Chen Author-Name: Nilanjan Chatterjee Author-X-Name-First: Nilanjan Author-X-Name-Last: Chatterjee Author-Name: Maria Teresa Landi Author-X-Name-First: Maria Teresa Author-X-Name-Last: Landi Author-Name: Jianxin Shi Author-X-Name-First: Jianxin Author-X-Name-Last: Shi Title: A Penalized Regression Framework for Building Polygenic Risk Models Based on Summary Statistics From Genome-Wide Association Studies and Incorporating External Information Abstract: Large-scale genome-wide association studies (GWAS) provide opportunities for developing genetic risk prediction models that have the potential to improve disease prevention, intervention or treatment. The key step is to develop polygenic risk score (PRS) models with high predictive performance for a given disease, which typically requires a large training dataset for selecting truly associated single nucleotide polymorphisms (SNPs) and estimating effect sizes accurately. Here, we develop a comprehensive penalized regression for fitting l 1 regularized regression models to GWAS summary statistics. We propose incorporating pleiotropy and annotation information into PRS (PANPRS) development through suitable formulation of penalty functions and associated tuning parameters. Extensive simulations show that PANPRS performs equally well or better than existing PRS methods when no functional annotation or pleiotropy is incorporated. When functional annotation data and pleiotropy are informative, PANPRS substantially outperforms existing PRS methods in simulations. Finally, we applied our methods to build PRS for type 2 diabetes and melanoma and found that incorporating relevant functional annotations and GWAS of genetically related traits improved prediction of these two complex diseases. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 133-143 Issue: 533 Volume: 116 Year: 2020 Month: 10 X-DOI: 10.1080/01621459.2020.1764849 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1764849 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:133-143 Template-Type: ReDIF-Article 1.0 Author-Name: Long Feng Author-X-Name-First: Long Author-X-Name-Last: Feng Author-Name: Xuan Bi Author-X-Name-First: Xuan Author-X-Name-Last: Bi Author-Name: Heping Zhang Author-X-Name-First: Heping Author-X-Name-Last: Zhang Title: Brain Regions Identified as Being Associated With Verbal Reasoning Through the Use of Imaging Regression via Internal Variation Abstract: Abstract–Brain-imaging data have been increasingly used to understand intellectual disabilities. Despite significant progress in biomedical research, the mechanisms for most of the intellectual disabilities remain unknown. Finding the underlying neurological mechanisms has proved difficult, especially in children due to the rapid development of their brains. We investigate verbal reasoning, which is a reliable measure of an individual’s general intellectual abilities, and develop a class of high-order imaging regression models to identify brain subregions which might be associated with this specific intellectual ability. A key novelty of our method is to take advantage of spatial brain structures, and specifically the piecewise smooth nature of most imaging coefficients in the form of high-order tensors. Our approach provides an effective and urgently needed method for identifying brain subregions potentially underlying certain intellectual disabilities. The idea behind our approach is a carefully constructed concept called internal variation (IV). The IV employs tensor decomposition and provides a computationally feasible substitution for total variation, which has been considered suitable to deal with similar problems but may not be scalable to high-order tensor regression. Before applying our method to analyze the real data, we conduct comprehensive simulation studies to demonstrate the validity of our method in imaging signal identification. Next, we present our results from the analysis of a dataset based on the Philadelphia Neurodevelopmental Cohort for which we preprocessed the data including reorienting, bias-field correcting, extracting, normalizing, and registering the magnetic resonance images from 978 individuals. Our analysis identified a subregion across the cingulate cortex and the corpus callosum as being associated with individuals’ verbal reasoning ability, which, to the best of our knowledge, is a novel region that has not been reported in the literature. This finding is useful in further investigation of functional mechanisms for verbal reasoning. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 144-158 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1766468 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1766468 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:144-158 Template-Type: ReDIF-Article 1.0 Author-Name: Michael R. Kosorok Author-X-Name-First: Michael R. Author-X-Name-Last: Kosorok Author-Name: Eric B. Laber Author-X-Name-First: Eric B. Author-X-Name-Last: Laber Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Title: Introduction to the Theory and Methods Special Issue on Precision Medicine and Individualized Policy Discovery Abstract: We introduce the Theory and Methods Special Issue on Precision Medicine and Individualized Policy Discovery. The issue consists of four discussion papers, grouped into two pairs, and sixteen regular research papers that cover many important lines of research on data-driven decision making. We hope that the many provocative and original ideas presented herein will inspire further work and development in precision medicine and personalization. Journal: Journal of the American Statistical Association Pages: 159-161 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1863224 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1863224 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:159-161 Template-Type: ReDIF-Article 1.0 Author-Name: Yifan Cui Author-X-Name-First: Yifan Author-X-Name-Last: Cui Author-Name: Eric Tchetgen Tchetgen Author-X-Name-First: Eric Author-X-Name-Last: Tchetgen Tchetgen Title: A Semiparametric Instrumental Variable Approach to Optimal Treatment Regimes Under Endogeneity Abstract: There is a fast-growing literature on estimating optimal treatment regimes based on randomized trials or observational studies under a key identifying condition of no unmeasured confounding. Because confounding by unmeasured factors cannot generally be ruled out with certainty in observational studies or randomized trials subject to noncompliance, we propose a general instrumental variable (IV) approach to learning optimal treatment regimes under endogeneity. Specifically, we establish identification of both value function E[YD(L)] for a given regime D and optimal regimes argmaxDE[YD(L)] with the aid of a binary IV, when no unmeasured confounding fails to hold. We also construct novel multiply robust classification-based estimators. Furthermore, we propose to identify and estimate optimal treatment regimes among those who would comply to the assigned treatment under a monotonicity assumption. In this latter case, we establish the somewhat surprising result that complier optimal regimes can be consistently estimated without directly collecting compliance information and therefore without the complier average treatment effect itself being identified. Our approach is illustrated via extensive simulation studies and a data application on the effect of child rearing on labor participation. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 162-173 Issue: 533 Volume: 116 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2020.1783272 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1783272 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:162-173 Template-Type: ReDIF-Article 1.0 Author-Name: Hongxiang Qiu Author-X-Name-First: Hongxiang Author-X-Name-Last: Qiu Author-Name: Marco Carone Author-X-Name-First: Marco Author-X-Name-Last: Carone Author-Name: Ekaterina Sadikova Author-X-Name-First: Ekaterina Author-X-Name-Last: Sadikova Author-Name: Maria Petukhova Author-X-Name-First: Maria Author-X-Name-Last: Petukhova Author-Name: Ronald C. Kessler Author-X-Name-First: Ronald C. Author-X-Name-Last: Kessler Author-Name: Alex Luedtke Author-X-Name-First: Alex Author-X-Name-Last: Luedtke Title: Optimal Individualized Decision Rules Using Instrumental Variable Methods Abstract: There is an extensive literature on the estimation and evaluation of optimal individualized treatment rules in settings where all confounders of the effect of treatment on outcome are observed. We study the development of individualized decision rules in settings where some of these confounders may not have been measured but a valid binary instrument is available for a binary treatment. We first consider individualized treatment rules, which will naturally be most interesting in settings where it is feasible to intervene directly on treatment. We then consider a setting where intervening on treatment is infeasible, but intervening to encourage treatment is feasible. In both of these settings, we also handle the case that the treatment is a limited resource so that optimal interventions focus the available resources on those individuals who will benefit most from treatment. Given a reference rule, we evaluate an optimal individualized rule by its average causal effect relative to a prespecified reference rule. We develop methods to estimate optimal individualized rules and construct asymptotically efficient plug-in estimators of the corresponding average causal effect relative to a prespecified reference rule. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 174-191 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1745814 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1745814 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:174-191 Template-Type: ReDIF-Article 1.0 Author-Name: Sukjin Han Author-X-Name-First: Sukjin Author-X-Name-Last: Han Title: Comment: Individualized Treatment Rules Under Endogeneity Journal: Journal of the American Statistical Association Pages: 192-195 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1831923 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1831923 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:192-195 Template-Type: ReDIF-Article 1.0 Author-Name: Bo Zhang Author-X-Name-First: Bo Author-X-Name-Last: Zhang Author-Name: Hongming Pu Author-X-Name-First: Hongming Author-X-Name-Last: Pu Title: Discussion of Cui and Tchetgen Tchetgen (2020) and Qiu et al. (2020) Journal: Journal of the American Statistical Association Pages: 196-199 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1832500 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1832500 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:196-199 Template-Type: ReDIF-Article 1.0 Author-Name: Yifan Cui Author-X-Name-First: Yifan Author-X-Name-Last: Cui Author-Name: Eric Tchetgen Tchetgen Author-X-Name-First: Eric Tchetgen Author-X-Name-Last: Tchetgen Title: Machine Intelligence for Individualized Decision Making Under a Counterfactual World: A Rejoinder Journal: Journal of the American Statistical Association Pages: 200-206 Issue: 533 Volume: 116 Year: 2021 Month: 2 X-DOI: 10.1080/01621459.2021.1872580 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1872580 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:200-206 Template-Type: ReDIF-Article 1.0 Author-Name: Hongxiang Qiu Author-X-Name-First: Hongxiang Author-X-Name-Last: Qiu Author-Name: Marco Carone Author-X-Name-First: Marco Author-X-Name-Last: Carone Author-Name: Ekaterina Sadikova Author-X-Name-First: Ekaterina Author-X-Name-Last: Sadikova Author-Name: Maria Petukhova Author-X-Name-First: Maria Author-X-Name-Last: Petukhova Author-Name: Ronald C. Kessler Author-X-Name-First: Ronald C. Author-X-Name-Last: Kessler Author-Name: Alex Luedtke Author-X-Name-First: Alex Author-X-Name-Last: Luedtke Title: Rejoinder: Optimal Individualized Decision Rules Using Instrumental Variable Methods Journal: Journal of the American Statistical Association Pages: 207-209 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1865166 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1865166 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:207-209 Template-Type: ReDIF-Article 1.0 Author-Name: Jared D. Huling Author-X-Name-First: Jared D. Author-X-Name-Last: Huling Author-Name: Maureen A. Smith Author-X-Name-First: Maureen A. Author-X-Name-Last: Smith Author-Name: Guanhua Chen Author-X-Name-First: Guanhua Author-X-Name-Last: Chen Title: A Two-Part Framework for Estimating Individualized Treatment Rules From Semicontinuous Outcomes Abstract: Health care payments are an important component of health care utilization and are thus a major focus in health services and health policy applications. However, payment outcomes are semicontinuous in that over a given period of time some patients incur no payments and some patients incur large costs. Individualized treatment rules (ITRs) are a major part of the push for tailoring treatments and interventions to patients, yet there is a little work focused on estimating ITRs from semicontinuous outcomes. In this article, we develop a framework for estimation of ITRs based on two-part modeling, wherein the ITR is estimated by separately targeting the zero part of the outcome and the strictly positive part. To improve performance when high-dimensional covariates are available, we leverage a scientifically plausible penalty that simultaneously selects variables and encourages the signs of coefficients for each variable to agree between the two components of the ITR. We develop an efficient algorithm for computation and prove oracle inequalities for the resulting estimation and prediction errors. We demonstrate the effectiveness of our approach in simulated examples and in a study of a health system intervention. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 210-223 Issue: 533 Volume: 116 Year: 2020 Month: 10 X-DOI: 10.1080/01621459.2020.1801449 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801449 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:210-223 Template-Type: ReDIF-Article 1.0 Author-Name: Lin Liu Author-X-Name-First: Lin Author-X-Name-Last: Liu Author-Name: Zach Shahn Author-X-Name-First: Zach Author-X-Name-Last: Shahn Author-Name: James M. Robins Author-X-Name-First: James M. Author-X-Name-Last: Robins Author-Name: Andrea Rotnitzky Author-X-Name-First: Andrea Author-X-Name-Last: Rotnitzky Title: Efficient Estimation of Optimal Regimes Under a No Direct Effect Assumption Abstract: We derive new estimators of an optimal joint testing and treatment regime under the no direct effect (NDE) assumption that a given laboratory, diagnostic, or screening test has no effect on a patient’s clinical outcomes except through the effect of the test results on the choice of treatment. We model the optimal joint strategy with an optimal structural nested mean model (opt-SNMM). The proposed estimators are more efficient than previous estimators of the parameters of an opt-SNMM because they efficiently leverage the “NDE of testing” assumption. Our methods will be of importance to decision scientists who either perform cost-benefit analyses or are tasked with the estimation of the “value of information” supplied by an expensive diagnostic test (such as an MRI to screen for lung cancer). Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 224-239 Issue: 533 Volume: 116 Year: 2021 Month: 1 X-DOI: 10.1080/01621459.2020.1856117 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1856117 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:224-239 Template-Type: ReDIF-Article 1.0 Author-Name: Haoyu Chen Author-X-Name-First: Haoyu Author-X-Name-Last: Chen Author-Name: Wenbin Lu Author-X-Name-First: Wenbin Author-X-Name-Last: Lu Author-Name: Rui Song Author-X-Name-First: Rui Author-X-Name-Last: Song Title: Statistical Inference for Online Decision Making: In a Contextual Bandit Setting Abstract: Online decision making problem requires us to make a sequence of decisions based on incremental information. Common solutions often need to learn a reward model of different actions given the contextual information and then maximize the long-term reward. It is meaningful to know if the posited model is reasonable and how the model performs in the asymptotic sense. We study this problem under the setup of the contextual bandit framework with a linear reward model. The ε-greedy policy is adopted to address the classic exploration-and-exploitation dilemma. Using the martingale central limit theorem, we show that the online ordinary least squares estimator of model parameters is asymptotically normal. When the linear model is misspecified, we propose the online weighted least squares estimator using the inverse propensity score weighting and also establish its asymptotic normality. Based on the properties of the parameter estimators, we further show that the in-sample inverse propensity weighted value estimator is asymptotically normal. We illustrate our results using simulations and an application to a news article recommendation dataset from Yahoo!. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 240-255 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1770098 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1770098 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:240-255 Template-Type: ReDIF-Article 1.0 Author-Name: Juliana Schulz Author-X-Name-First: Juliana Author-X-Name-Last: Schulz Author-Name: Erica E. M. Moodie Author-X-Name-First: Erica E. M. Author-X-Name-Last: Moodie Title: Doubly Robust Estimation of Optimal Dosing Strategies Abstract: The goal of precision medicine is to tailor treatment strategies on an individual patient level. Although several estimation techniques have been developed for determining optimal treatment rules, the majority of methods focus on the case of a dichotomous treatment, an example being the dynamic weighted ordinary least squares regression approach of Wallace and Moodie. We propose an extension to the aforementioned framework to allow for a continuous treatment with the ultimate goal of estimating optimal dosing strategies. The proposed method is shown to be doubly robust against model misspecification whenever the implemented weights satisfy a particular balancing condition. A broad class of weight functions can be derived from the balancing condition, providing a flexible regression based estimation method in the context of adaptive treatment strategies for continuous valued treatments. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 256-268 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1753521 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1753521 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:256-268 Template-Type: ReDIF-Article 1.0 Author-Name: Yuan Chen Author-X-Name-First: Yuan Author-X-Name-Last: Chen Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Yuanjia Wang Author-X-Name-First: Yuanjia Author-X-Name-Last: Wang Title: Learning Individualized Treatment Rules for Multiple-Domain Latent Outcomes Abstract: For many mental disorders, latent mental status from multiple-domain psychological or clinical symptoms may perform as a better characterization of the underlying disorder status than a simple summary score of the symptoms, and they may also serve as more reliable and representative features to differentiate treatment responses. Therefore, to address the complexity and heterogeneity of treatment responses for mental disorders, we provide a new paradigm for learning optimal individualized treatment rules (ITRs) by modeling patients’ latent mental status. We first learn the multi-domain latent states at baseline from the observed symptoms under a restricted Boltzmann machine (RBM) model, which encodes patients’ heterogeneous symptoms using an economical number of latent variables and yet remains flexible. We then optimize a value function defined by the latent states after treatment by exploiting a transformation of the observed symptoms based on the RBM without modeling the relationship between the latent mental states before and after treatment. The optimal treatment rules are derived using a weighted large margin classifier. We derive the convergence rate of the proposed estimator under the latent models. Simulation studies are conducted to test the performance of the proposed method. Finally, we apply the developed method to real world studies and we demonstrate the utility and advantage of our method in tailoring treatments for patients with major depression, and identify patient subgroups informative for treatment recommendations. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 269-282 Issue: 533 Volume: 116 Year: 2020 Month: 10 X-DOI: 10.1080/01621459.2020.1817751 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1817751 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:269-282 Template-Type: ReDIF-Article 1.0 Author-Name: Yinghao Pan Author-X-Name-First: Yinghao Author-X-Name-Last: Pan Author-Name: Ying-Qi Zhao Author-X-Name-First: Ying-Qi Author-X-Name-Last: Zhao Title: Improved Doubly Robust Estimation in Learning Optimal Individualized Treatment Rules Abstract: Individualized treatment rules (ITRs) recommend treatment according to patient characteristics. There is a growing interest in developing novel and efficient statistical methods in constructing ITRs. We propose an improved doubly robust estimator of the optimal ITRs. The proposed estimator is based on a direct optimization of an augmented inverse-probability weighted estimator of the expected clinical outcome over a class of ITRs. The method enjoys two key properties. First, it is doubly robust, meaning that the proposed estimator is consistent when either the propensity score or the outcome model is correct. Second, it achieves the smallest variance among the class of doubly robust estimators when the propensity score model is correctly specified, regardless of the specification of the outcome model. Simulation studies show that the estimated ITRs obtained from our method yield better results than those obtained from current popular methods. Data from the Sequenced Treatment Alternatives to Relieve Depression study is analyzed as an illustrative example. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 283-294 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1725522 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1725522 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:283-294 Template-Type: ReDIF-Article 1.0 Author-Name: Bo Zhang Author-X-Name-First: Bo Author-X-Name-Last: Zhang Author-Name: Jordan Weiss Author-X-Name-First: Jordan Author-X-Name-Last: Weiss Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Author-Name: Qingyuan Zhao Author-X-Name-First: Qingyuan Author-X-Name-Last: Zhao Title: Selecting and Ranking Individualized Treatment Rules With Unmeasured Confounding Abstract: It is common to compare individualized treatment rules based on the value function, which is the expected potential outcome under the treatment rule. Although the value function is not point-identified when there is unmeasured confounding, it still defines a partial order among the treatment rules under Rosenbaum’s sensitivity analysis model. We first consider how to compare two treatment rules with unmeasured confounding in the single-decision setting and then use this pairwise test to rank multiple treatment rules. We consider how to, among many treatment rules, select the best rules and select the rules that are better than a control rule. The proposed methods are illustrated using two real examples, one about the benefit of malaria prevention programs to different age groups and another about the effect of late retirement on senior health in different gender and occupation groups. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 295-308 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1736083 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1736083 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:295-308 Template-Type: ReDIF-Article 1.0 Author-Name: Wenchuan Guo Author-X-Name-First: Wenchuan Author-X-Name-Last: Guo Author-Name: Xiao-Hua Zhou Author-X-Name-First: Xiao-Hua Author-X-Name-Last: Zhou Author-Name: Shujie Ma Author-X-Name-First: Shujie Author-X-Name-Last: Ma Title: Estimation of Optimal Individualized Treatment Rules Using a Covariate-Specific Treatment Effect Curve With High-Dimensional Covariates Abstract: With a large number of baseline covariates, we propose a new semiparametric modeling strategy for heterogeneous treatment effect estimation and individualized treatment selection, which are two major goals in personalized medicine. We achieve the first goal through estimating a covariate-specific treatment effect (CSTE) curve modeled as an unknown function of a weighted linear combination of all baseline covariates. The weight or the coefficient for each covariate is estimated by fitting a sparse semiparametric logistic single-index coefficient model. The CSTE curve is estimated by a spline-backfitted kernel procedure, which enables us to further construct a simultaneous confidence band (SCB) for the CSTE curve under a desired confidence level. Based on the SCB, we find the subgroups of patients that benefit from each treatment, so that we can make individualized treatment selection. The innovations of the proposed method are 3-fold. First, the proposed method can quantify variability associated with the estimated optimal individualized treatment rule with high-dimensional covariates. Second, the proposed method is very flexible to depict both local and global associations between the treatment and baseline covariates in the presence of high-dimensional covariates, and thus it enjoys flexibility while achieving dimensionality reduction. Third, the SCB achieves the nominal confidence level asymptotically, and it provides a uniform inferential tool in making individualized treatment decisions. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 309-321 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1865167 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1865167 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:309-321 Template-Type: ReDIF-Article 1.0 Author-Name: Ruitao Lin Author-X-Name-First: Ruitao Author-X-Name-Last: Lin Author-Name: Peter F. Thall Author-X-Name-First: Peter F. Author-X-Name-Last: Thall Author-Name: Ying Yuan Author-X-Name-First: Ying Author-X-Name-Last: Yuan Title: BAGS: A Bayesian Adaptive Group Sequential Trial Design With Subgroup-Specific Survival Comparisons Abstract: A Bayesian group sequential design is proposed that performs survival comparisons within patient subgroups in randomized trials where treatment–subgroup interactions may be present. A latent subgroup membership variable is assumed to allow the design to adaptively combine homogeneous subgroups, or split heterogeneous subgroups, to improve the procedure’s within-subgroup power. If a baseline covariate related to survival is available, the design may incorporate this information to improve subgroup identification while basing the comparative test on the average hazard ratio. General guidelines are provided for calibrating prior hyperparameters and design parameters to control the overall Type I error rate and optimize performance. Simulations show that the design is robust under a wide variety of different scenarios. When two or more subgroups are truly homogeneous but differ from the other subgroups, the proposed method is substantially more powerful than tests that either ignore subgroups or conduct a separate test within each subgroup. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 322-334 Issue: 533 Volume: 116 Year: 2020 Month: 11 X-DOI: 10.1080/01621459.2020.1837142 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1837142 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:322-334 Template-Type: ReDIF-Article 1.0 Author-Name: Steve Yadlowsky Author-X-Name-First: Steve Author-X-Name-Last: Yadlowsky Author-Name: Fabio Pellegrini Author-X-Name-First: Fabio Author-X-Name-Last: Pellegrini Author-Name: Federica Lionetto Author-X-Name-First: Federica Author-X-Name-Last: Lionetto Author-Name: Stefan Braune Author-X-Name-First: Stefan Author-X-Name-Last: Braune Author-Name: Lu Tian Author-X-Name-First: Lu Author-X-Name-Last: Tian Title: Estimation and Validation of Ratio-based Conditional Average Treatment Effects Using Observational Data Abstract: While sample sizes in randomized clinical trials are large enough to estimate the average treatment effect well, they are often insufficient for estimation of treatment-covariate interactions critical to studying data-driven precision medicine. Observational data from real world practice may play an important role in alleviating this problem. One common approach in trials is to predict the outcome of interest with separate regression models in each treatment arm, and estimate the treatment effect based on the contrast of the predictions. Unfortunately, this simple approach may induce spurious treatment-covariate interaction in observational studies when the regression model is misspecified. Motivated by the need of modeling the number of relapses in multiple sclerosis (MS) patients, where the ratio of relapse rates is a natural choice of the treatment effect, we propose to estimate the conditional average treatment effect (CATE) as the ratio of expected potential outcomes, and derive a doubly robust estimator of this CATE in a semiparametric model of treatment-covariate interactions. We also provide a validation procedure to check the quality of the estimator on an independent sample. We conduct simulations to demonstrate the finite sample performance of the proposed methods, and illustrate their advantages on real data by examining the treatment effect of dimethyl fumarate compared to teriflunomide in MS patients. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 335-352 Issue: 533 Volume: 116 Year: 2020 Month: 7 X-DOI: 10.1080/01621459.2020.1772080 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1772080 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:335-352 Template-Type: ReDIF-Article 1.0 Author-Name: Xinran Li Author-X-Name-First: Xinran Author-X-Name-Last: Li Author-Name: Xiao-Li Meng Author-X-Name-First: Xiao-Li Author-X-Name-Last: Meng Title: A Multi-resolution Theory for Approximating Infinite-p-Zero-n: Transitional Inference, Individualized Predictions, and a World Without Bias-Variance Tradeoff Abstract: Transitional inference is an empiricism concept, rooted and practiced in clinical medicine since ancient Greece. Knowledge and experiences gained from treating one entity (e.g., a disease or a group of patients) are applied to treat a related but distinctively different one (e.g., a similar disease or a new patient). This notion of “transition to the similar” renders individualized treatments an operational meaning, yet its theoretical foundation defies the familiar inductive inference framework. The uniqueness of entities is the result of potentially an infinite number of attributes (hence p=∞), which entails zero direct training sample size (i.e., n = 0) because genuine guinea pigs do not exist. However, the literature on wavelets and on sieve methods for nonparametric estimation suggests a principled approximation theory for transitional inference via a multi-resolution (MR) perspective, where we use the resolution level to index the degree of approximation to ultimate individuality. MR inference seeks a primary resolution indexing an indirect training sample, which provides enough matched attributes to increase the relevance of the results to the target individuals and yet still accumulate sufficient indirect sample sizes for robust estimation. Theoretically, MR inference relies on an infinite-term ANOVA-type decomposition, providing an alternative way to model sparsity via the decay rate of the resolution bias as a function of the primary resolution level. Unexpectedly, this decomposition reveals a world without variance when the outcome is a deterministic function of potentially infinitely many predictors. In this deterministic world, the optimal resolution prefers over-fitting in the traditional sense when the resolution bias decays sufficiently rapidly. Furthermore, there can be many “descents” in the prediction error curve, when the contributions of predictors are inhomogeneous and the ordering of their importance does not align with the order of their inclusion in prediction. These findings may hint at a deterministic approximation theory for understanding the apparently over-fitting resistant phenomenon of some over-saturated models in machine learning. Journal: Journal of the American Statistical Association Pages: 353-367 Issue: 533 Volume: 116 Year: 2020 Month: 12 X-DOI: 10.1080/01621459.2020.1844210 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1844210 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:353-367 Template-Type: ReDIF-Article 1.0 Author-Name: Ashkan Ertefaie Author-X-Name-First: Ashkan Author-X-Name-Last: Ertefaie Author-Name: James R. McKay Author-X-Name-First: James R. Author-X-Name-Last: McKay Author-Name: David Oslin Author-X-Name-First: David Author-X-Name-Last: Oslin Author-Name: Robert L. Strawderman Author-X-Name-First: Robert L. Author-X-Name-Last: Strawderman Title: Robust Q-Learning Abstract: Abstract–Q-learning is a regression-based approach that is widely used to formalize the development of an optimal dynamic treatment strategy. Finite dimensional working models are typically used to estimate certain nuisance parameters, and misspecification of these working models can result in residual confounding and/or efficiency loss. We propose a robust Q-learning approach which allows estimating such nuisance parameters using data-adaptive techniques. We study the asymptotic behavior of our estimators and provide simulation studies that highlight the need for and usefulness of the proposed method in practice. We use the data from the “Extending Treatment Effectiveness of Naltrexone” multistage randomized trial to illustrate our proposed methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 368-381 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1753522 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1753522 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:368-381 Template-Type: ReDIF-Article 1.0 Author-Name: Peng Liao Author-X-Name-First: Peng Author-X-Name-Last: Liao Author-Name: Predrag Klasnja Author-X-Name-First: Predrag Author-X-Name-Last: Klasnja Author-Name: Susan Murphy Author-X-Name-First: Susan Author-X-Name-Last: Murphy Title: Off-Policy Estimation of Long-Term Average Outcomes With Applications to Mobile Health Abstract: Due to the recent advancements in wearables and sensing technology, health scientists are increasingly developing mobile health (mHealth) interventions. In mHealth interventions, mobile devices are used to deliver treatment to individuals as they go about their daily lives. These treatments are generally designed to impact a near time, proximal outcome such as stress or physical activity. The mHealth intervention policies, often called just-in-time adaptive interventions, are decision rules that map an individual’s current state (e.g., individual’s past behaviors as well as current observations of time, location, social activity, stress, and urges to smoke) to a particular treatment at each of many time points. The vast majority of current mHealth interventions deploy expert-derived policies. In this article, we provide an approach for conducting inference about the performance of one or more such policies using historical data collected under a possibly different policy. Our measure of performance is the average of proximal outcomes over a long time period should the particular mHealth policy be followed. We provide an estimator as well as confidence intervals. This work is motivated by HeartSteps, an mHealth physical activity intervention. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 382-391 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1807993 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1807993 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:382-391 Template-Type: ReDIF-Article 1.0 Author-Name: Xinkun Nie Author-X-Name-First: Xinkun Author-X-Name-Last: Nie Author-Name: Emma Brunskill Author-X-Name-First: Emma Author-X-Name-Last: Brunskill Author-Name: Stefan Wager Author-X-Name-First: Stefan Author-X-Name-Last: Wager Title: Learning When-to-Treat Policies Abstract: Many applied decision-making problems have a dynamic component: The policymaker needs not only to choose whom to treat, but also when to start which treatment. For example, a medical doctor may choose between postponing treatment (watchful waiting) and prescribing one of several available treatments during the many visits from a patient. We develop an “advantage doubly robust” estimator for learning such dynamic treatment rules using observational data under the assumption of sequential ignorability. We prove welfare regret bounds that generalize results for doubly robust learning in the single-step setting, and show promising empirical performance in several different contexts. Our approach is practical for policy optimization, and does not need any structural (e.g., Markovian) assumptions. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 392-409 Issue: 533 Volume: 116 Year: 2020 Month: 11 X-DOI: 10.1080/01621459.2020.1831925 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1831925 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:392-409 Template-Type: ReDIF-Article 1.0 Author-Name: Xinyu Hu Author-X-Name-First: Xinyu Author-X-Name-Last: Hu Author-Name: Min Qian Author-X-Name-First: Min Author-X-Name-Last: Qian Author-Name: Bin Cheng Author-X-Name-First: Bin Author-X-Name-Last: Cheng Author-Name: Ying Kuen Cheung Author-X-Name-First: Ying Kuen Author-X-Name-Last: Cheung Title: Personalized Policy Learning Using Longitudinal Mobile Health Data Abstract: Personalized policy represents a paradigm shift one decision rule for all users to an individualized decision rule for each user. Developing personalized policy in mobile health applications imposes challenges. First, for lack of adherence, data from each user are limited. Second, unmeasured contextual factors can potentially impact on decision making. Aiming to optimize immediate rewards, we propose using a generalized linear mixed modeling framework where population features and individual features are modeled as fixed and random effects, respectively, and synthesized to form the personalized policy. The group lasso type penalty is imposed to avoid overfitting of individual deviations from the population model. We examine the conditions under which the proposed method work in the presence of time-varying endogenous covariates, and provide conditional optimality and marginal consistency results of the expected immediate outcome under the estimated policies. We apply our method to develop personalized push (“prompt”) schedules in 294 app users, with the goal to maximize the prompt response rate given past app usage and other contextual factors. The proposed method compares favorably to existing estimation methods including using the R function “glmer” in a simulation study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 410-420 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2020.1785476 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1785476 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:410-420 Template-Type: ReDIF-Article 1.0 Author-Name: Yilun Sun Author-X-Name-First: Yilun Author-X-Name-Last: Sun Author-Name: Lu Wang Author-X-Name-First: Lu Author-X-Name-Last: Wang Title: Stochastic Tree Search for Estimating Optimal Dynamic Treatment Regimes Abstract: A dynamic treatment regime (DTR) is a sequence of decision rules that adapt to the time-varying states of an individual. Black-box learning methods have shown great potential in predicting the optimal treatments; however, the resulting DTRs lack interpretability, which is of paramount importance for medical experts to understand and implement. We present a stochastic tree-based reinforcement learning (ST-RL) method for estimating optimal DTRs in a multistage multitreatment setting with data from either randomized trials or observational studies. At each stage, ST-RL constructs a decision tree by first modeling the mean of counterfactual outcomes via nonparametric regression models, and then stochastically searching for the optimal tree-structured decision rule using a Markov chain Monte Carlo algorithm. We implement the proposed method in a backward inductive fashion through multiple decision stages. The proposed ST-RL delivers optimal DTRs with better interpretability and contributes to the existing literature in its non-greedy policy search. Additionally, ST-RL demonstrates stable and outstanding performances even with a large number of covariates, which is especially appealing when data are from large observational studies. We illustrate the performance of ST-RL through simulation studies, and also a real data application using esophageal cancer data collected from 1170 patients at MD Anderson Cancer Center from 1998 to 2012. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 421-432 Issue: 533 Volume: 116 Year: 2020 Month: 10 X-DOI: 10.1080/01621459.2020.1819294 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1819294 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:421-432 Template-Type: ReDIF-Article 1.0 Author-Name: Christopher Nemeth Author-X-Name-First: Christopher Author-X-Name-Last: Nemeth Author-Name: Paul Fearnhead Author-X-Name-First: Paul Author-X-Name-Last: Fearnhead Title: Stochastic Gradient Markov Chain Monte Carlo Abstract: Markov chain Monte Carlo (MCMC) algorithms are generally regarded as the gold standard technique for Bayesian inference. They are theoretically well-understood and conceptually simple to apply in practice. The drawback of MCMC is that performing exact inference generally requires all of the data to be processed at each iteration of the algorithm. For large datasets, the computational cost of MCMC can be prohibitive, which has led to recent developments in scalable Monte Carlo algorithms that have a significantly lower computational cost than standard MCMC. In this article, we focus on a particular class of scalable Monte Carlo algorithms, stochastic gradient Markov chain Monte Carlo (SGMCMC) which utilizes data subsampling techniques to reduce the per-iteration cost of MCMC. We provide an introduction to some popular SGMCMC algorithms and review the supporting theoretical results, as well as comparing the efficiency of SGMCMC algorithms against MCMC on benchmark examples. The supporting R code is available online at https://github.com/chris-nemeth/sgmcmc-review-paper. Journal: Journal of the American Statistical Association Pages: 433-450 Issue: 533 Volume: 116 Year: 2021 Month: 1 X-DOI: 10.1080/01621459.2020.1847120 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1847120 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:433-450 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Handbook of Spatial Epidemiology Journal: Journal of the American Statistical Association Pages: 451-453 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2021.1880230 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1880230 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:451-453 Template-Type: ReDIF-Article 1.0 Author-Name: Grace S. Chiu Author-X-Name-First: Grace S. Author-X-Name-Last: Chiu Title: Handbook of Environmental and Ecological Statistics. Journal: Journal of the American Statistical Association Pages: 453-455 Issue: 533 Volume: 116 Year: 2021 Month: 3 X-DOI: 10.1080/01621459.2021.1880232 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1880232 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:453-455 Template-Type: ReDIF-Article 1.0 Author-Name: Xinkun Nie Author-X-Name-First: Xinkun Author-X-Name-Last: Nie Author-Name: Emma Brunskill Author-X-Name-First: Emma Author-X-Name-Last: Brunskill Author-Name: Stefan Wager Author-X-Name-First: Stefan Author-X-Name-Last: Wager Title: Learning When-to-Treat Policies Abstract: Many applied decision-making problems have a dynamic component: The policymaker needs not only to choose whom to treat, but also when to start which treatment. For example, a medical doctor may choose between postponing treatment (watchful waiting) and prescribing one of several available treatments during the many visits from a patient. We develop an “advantage doubly robust” estimator for learning such dynamic treatment rules using observational data under the assumption of sequential ignorability. We prove welfare regret bounds that generalize results for doubly robust learning in the single-step setting, and show promising empirical performance in several different contexts. Our approach is practical for policy optimization, and does not need any structural (e.g., Markovian) assumptions. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 392-409 Issue: 533 Volume: 116 Year: 2021 Month: 1 X-DOI: 10.1080/01621459.2020.1831925 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1831925 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:392-409 Template-Type: ReDIF-Article 1.0 Author-Name: Yilun Sun Author-X-Name-First: Yilun Author-X-Name-Last: Sun Author-Name: Lu Wang Author-X-Name-First: Lu Author-X-Name-Last: Wang Title: Stochastic Tree Search for Estimating Optimal Dynamic Treatment Regimes Abstract: A dynamic treatment regime (DTR) is a sequence of decision rules that adapt to the time-varying states of an individual. Black-box learning methods have shown great potential in predicting the optimal treatments; however, the resulting DTRs lack interpretability, which is of paramount importance for medical experts to understand and implement. We present a stochastic tree-based reinforcement learning (ST-RL) method for estimating optimal DTRs in a multistage multitreatment setting with data from either randomized trials or observational studies. At each stage, ST-RL constructs a decision tree by first modeling the mean of counterfactual outcomes via nonparametric regression models, and then stochastically searching for the optimal tree-structured decision rule using a Markov chain Monte Carlo algorithm. We implement the proposed method in a backward inductive fashion through multiple decision stages. The proposed ST-RL delivers optimal DTRs with better interpretability and contributes to the existing literature in its non-greedy policy search. Additionally, ST-RL demonstrates stable and outstanding performances even with a large number of covariates, which is especially appealing when data are from large observational studies. We illustrate the performance of ST-RL through simulation studies, and also a real data application using esophageal cancer data collected from 1170 patients at MD Anderson Cancer Center from 1998 to 2012. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 421-432 Issue: 533 Volume: 116 Year: 2021 Month: 1 X-DOI: 10.1080/01621459.2020.1819294 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1819294 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:421-432 Template-Type: ReDIF-Article 1.0 Author-Name: Yuan Chen Author-X-Name-First: Yuan Author-X-Name-Last: Chen Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Yuanjia Wang Author-X-Name-First: Yuanjia Author-X-Name-Last: Wang Title: Learning Individualized Treatment Rules for Multiple-Domain Latent Outcomes Abstract: For many mental disorders, latent mental status from multiple-domain psychological or clinical symptoms may perform as a better characterization of the underlying disorder status than a simple summary score of the symptoms, and they may also serve as more reliable and representative features to differentiate treatment responses. Therefore, to address the complexity and heterogeneity of treatment responses for mental disorders, we provide a new paradigm for learning optimal individualized treatment rules (ITRs) by modeling patients’ latent mental status. We first learn the multi-domain latent states at baseline from the observed symptoms under a restricted Boltzmann machine (RBM) model, which encodes patients’ heterogeneous symptoms using an economical number of latent variables and yet remains flexible. We then optimize a value function defined by the latent states after treatment by exploiting a transformation of the observed symptoms based on the RBM without modeling the relationship between the latent mental states before and after treatment. The optimal treatment rules are derived using a weighted large margin classifier. We derive the convergence rate of the proposed estimator under the latent models. Simulation studies are conducted to test the performance of the proposed method. Finally, we apply the developed method to real world studies and we demonstrate the utility and advantage of our method in tailoring treatments for patients with major depression, and identify patient subgroups informative for treatment recommendations. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 269-282 Issue: 533 Volume: 116 Year: 2021 Month: 1 X-DOI: 10.1080/01621459.2020.1817751 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1817751 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:269-282 Template-Type: ReDIF-Article 1.0 Author-Name: Ruitao Lin Author-X-Name-First: Ruitao Author-X-Name-Last: Lin Author-Name: Peter F. Thall Author-X-Name-First: Peter F. Author-X-Name-Last: Thall Author-Name: Ying Yuan Author-X-Name-First: Ying Author-X-Name-Last: Yuan Title: BAGS: A Bayesian Adaptive Group Sequential Trial Design With Subgroup-Specific Survival Comparisons Abstract: A Bayesian group sequential design is proposed that performs survival comparisons within patient subgroups in randomized trials where treatment–subgroup interactions may be present. A latent subgroup membership variable is assumed to allow the design to adaptively combine homogeneous subgroups, or split heterogeneous subgroups, to improve the procedure’s within-subgroup power. If a baseline covariate related to survival is available, the design may incorporate this information to improve subgroup identification while basing the comparative test on the average hazard ratio. General guidelines are provided for calibrating prior hyperparameters and design parameters to control the overall Type I error rate and optimize performance. Simulations show that the design is robust under a wide variety of different scenarios. When two or more subgroups are truly homogeneous but differ from the other subgroups, the proposed method is substantially more powerful than tests that either ignore subgroups or conduct a separate test within each subgroup. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 322-334 Issue: 533 Volume: 116 Year: 2021 Month: 1 X-DOI: 10.1080/01621459.2020.1837142 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1837142 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:322-334 Template-Type: ReDIF-Article 1.0 Author-Name: Jared D. Huling Author-X-Name-First: Jared D. Author-X-Name-Last: Huling Author-Name: Maureen A. Smith Author-X-Name-First: Maureen A. Author-X-Name-Last: Smith Author-Name: Guanhua Chen Author-X-Name-First: Guanhua Author-X-Name-Last: Chen Title: A Two-Part Framework for Estimating Individualized Treatment Rules From Semicontinuous Outcomes Abstract: Health care payments are an important component of health care utilization and are thus a major focus in health services and health policy applications. However, payment outcomes are semicontinuous in that over a given period of time some patients incur no payments and some patients incur large costs. Individualized treatment rules (ITRs) are a major part of the push for tailoring treatments and interventions to patients, yet there is a little work focused on estimating ITRs from semicontinuous outcomes. In this article, we develop a framework for estimation of ITRs based on two-part modeling, wherein the ITR is estimated by separately targeting the zero part of the outcome and the strictly positive part. To improve performance when high-dimensional covariates are available, we leverage a scientifically plausible penalty that simultaneously selects variables and encourages the signs of coefficients for each variable to agree between the two components of the ITR. We develop an efficient algorithm for computation and prove oracle inequalities for the resulting estimation and prediction errors. We demonstrate the effectiveness of our approach in simulated examples and in a study of a health system intervention. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 210-223 Issue: 533 Volume: 116 Year: 2021 Month: 1 X-DOI: 10.1080/01621459.2020.1801449 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801449 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:210-223 Template-Type: ReDIF-Article 1.0 Author-Name: Ting-Huei Chen Author-X-Name-First: Ting-Huei Author-X-Name-Last: Chen Author-Name: Nilanjan Chatterjee Author-X-Name-First: Nilanjan Author-X-Name-Last: Chatterjee Author-Name: Maria Teresa Landi Author-X-Name-First: Maria Teresa Author-X-Name-Last: Landi Author-Name: Jianxin Shi Author-X-Name-First: Jianxin Author-X-Name-Last: Shi Title: A Penalized Regression Framework for Building Polygenic Risk Models Based on Summary Statistics From Genome-Wide Association Studies and Incorporating External Information Abstract: Large-scale genome-wide association studies (GWAS) provide opportunities for developing genetic risk prediction models that have the potential to improve disease prevention, intervention or treatment. The key step is to develop polygenic risk score (PRS) models with high predictive performance for a given disease, which typically requires a large training dataset for selecting truly associated single nucleotide polymorphisms (SNPs) and estimating effect sizes accurately. Here, we develop a comprehensive penalized regression for fitting l 1 regularized regression models to GWAS summary statistics. We propose incorporating pleiotropy and annotation information into PRS (PANPRS) development through suitable formulation of penalty functions and associated tuning parameters. Extensive simulations show that PANPRS performs equally well or better than existing PRS methods when no functional annotation or pleiotropy is incorporated. When functional annotation data and pleiotropy are informative, PANPRS substantially outperforms existing PRS methods in simulations. Finally, we applied our methods to build PRS for type 2 diabetes and melanoma and found that incorporating relevant functional annotations and GWAS of genetically related traits improved prediction of these two complex diseases. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 133-143 Issue: 533 Volume: 116 Year: 2021 Month: 1 X-DOI: 10.1080/01621459.2020.1764849 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1764849 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:133-143 Template-Type: ReDIF-Article 1.0 Author-Name: Yifan Cui Author-X-Name-First: Yifan Author-X-Name-Last: Cui Author-Name: Eric Tchetgen Tchetgen Author-X-Name-First: Eric Author-X-Name-Last: Tchetgen Tchetgen Title: A Semiparametric Instrumental Variable Approach to Optimal Treatment Regimes Under Endogeneity Abstract: There is a fast-growing literature on estimating optimal treatment regimes based on randomized trials or observational studies under a key identifying condition of no unmeasured confounding. Because confounding by unmeasured factors cannot generally be ruled out with certainty in observational studies or randomized trials subject to noncompliance, we propose a general instrumental variable (IV) approach to learning optimal treatment regimes under endogeneity. Specifically, we establish identification of both value function E[YD(L)] for a given regime D and optimal regimes argmaxDE[YD(L)] with the aid of a binary IV, when no unmeasured confounding fails to hold. We also construct novel multiply robust classification-based estimators. Furthermore, we propose to identify and estimate optimal treatment regimes among those who would comply to the assigned treatment under a monotonicity assumption. In this latter case, we establish the somewhat surprising result that complier optimal regimes can be consistently estimated without directly collecting compliance information and therefore without the complier average treatment effect itself being identified. Our approach is illustrated via extensive simulation studies and a data application on the effect of child rearing on labor participation. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 162-173 Issue: 533 Volume: 116 Year: 2021 Month: 1 X-DOI: 10.1080/01621459.2020.1783272 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1783272 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:162-173 Template-Type: ReDIF-Article 1.0 Author-Name: Steve Yadlowsky Author-X-Name-First: Steve Author-X-Name-Last: Yadlowsky Author-Name: Fabio Pellegrini Author-X-Name-First: Fabio Author-X-Name-Last: Pellegrini Author-Name: Federica Lionetto Author-X-Name-First: Federica Author-X-Name-Last: Lionetto Author-Name: Stefan Braune Author-X-Name-First: Stefan Author-X-Name-Last: Braune Author-Name: Lu Tian Author-X-Name-First: Lu Author-X-Name-Last: Tian Title: Estimation and Validation of Ratio-based Conditional Average Treatment Effects Using Observational Data Abstract: While sample sizes in randomized clinical trials are large enough to estimate the average treatment effect well, they are often insufficient for estimation of treatment-covariate interactions critical to studying data-driven precision medicine. Observational data from real world practice may play an important role in alleviating this problem. One common approach in trials is to predict the outcome of interest with separate regression models in each treatment arm, and estimate the treatment effect based on the contrast of the predictions. Unfortunately, this simple approach may induce spurious treatment-covariate interaction in observational studies when the regression model is misspecified. Motivated by the need of modeling the number of relapses in multiple sclerosis (MS) patients, where the ratio of relapse rates is a natural choice of the treatment effect, we propose to estimate the conditional average treatment effect (CATE) as the ratio of expected potential outcomes, and derive a doubly robust estimator of this CATE in a semiparametric model of treatment-covariate interactions. We also provide a validation procedure to check the quality of the estimator on an independent sample. We conduct simulations to demonstrate the finite sample performance of the proposed methods, and illustrate their advantages on real data by examining the treatment effect of dimethyl fumarate compared to teriflunomide in MS patients. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 335-352 Issue: 533 Volume: 116 Year: 2021 Month: 1 X-DOI: 10.1080/01621459.2020.1772080 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1772080 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:335-352 Template-Type: ReDIF-Article 1.0 Author-Name: Xinran Li Author-X-Name-First: Xinran Author-X-Name-Last: Li Author-Name: Xiao-Li Meng Author-X-Name-First: Xiao-Li Author-X-Name-Last: Meng Title: A Multi-resolution Theory for Approximating Infinite-p-Zero-n: Transitional Inference, Individualized Predictions, and a World Without Bias-Variance Tradeoff Abstract: Transitional inference is an empiricism concept, rooted and practiced in clinical medicine since ancient Greece. Knowledge and experiences gained from treating one entity (e.g., a disease or a group of patients) are applied to treat a related but distinctively different one (e.g., a similar disease or a new patient). This notion of “transition to the similar” renders individualized treatments an operational meaning, yet its theoretical foundation defies the familiar inductive inference framework. The uniqueness of entities is the result of potentially an infinite number of attributes (hence p=∞), which entails zero direct training sample size (i.e., n = 0) because genuine guinea pigs do not exist. However, the literature on wavelets and on sieve methods for nonparametric estimation suggests a principled approximation theory for transitional inference via a multi-resolution (MR) perspective, where we use the resolution level to index the degree of approximation to ultimate individuality. MR inference seeks a primary resolution indexing an indirect training sample, which provides enough matched attributes to increase the relevance of the results to the target individuals and yet still accumulate sufficient indirect sample sizes for robust estimation. Theoretically, MR inference relies on an infinite-term ANOVA-type decomposition, providing an alternative way to model sparsity via the decay rate of the resolution bias as a function of the primary resolution level. Unexpectedly, this decomposition reveals a world without variance when the outcome is a deterministic function of potentially infinitely many predictors. In this deterministic world, the optimal resolution prefers over-fitting in the traditional sense when the resolution bias decays sufficiently rapidly. Furthermore, there can be many “descents” in the prediction error curve, when the contributions of predictors are inhomogeneous and the ordering of their importance does not align with the order of their inclusion in prediction. These findings may hint at a deterministic approximation theory for understanding the apparently over-fitting resistant phenomenon of some over-saturated models in machine learning. Journal: Journal of the American Statistical Association Pages: 353-367 Issue: 533 Volume: 116 Year: 2021 Month: 1 X-DOI: 10.1080/01621459.2020.1844210 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1844210 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:353-367 Template-Type: ReDIF-Article 1.0 Author-Name: Shonosuke Sugasawa Author-X-Name-First: Shonosuke Author-X-Name-Last: Sugasawa Title: Grouped Heterogeneous Mixture Modeling for Clustered Data Abstract: Clustered data are ubiquitous in a variety of scientific fields. In this article, we propose a flexible and interpretable modeling approach, called grouped heterogeneous mixture modeling, for clustered data, which models cluster-wise conditional distributions by mixtures of latent conditional distributions common to all the clusters. In the model, we assume that clusters are divided into a finite number of groups and mixing proportions are the same within the same group. We provide a simple generalized EM algorithm for computing the maximum likelihood estimator, and an information criterion to select the numbers of groups and latent distributions. We also propose structured grouping strategies by introducing penalties on grouping parameters in the likelihood function. Under the settings where both the number of clusters and cluster sizes tend to infinity, we present asymptotic properties of the maximum likelihood estimator and the information criterion. We demonstrate the proposed method through simulation studies and an application to crime risk modeling in Tokyo. Journal: Journal of the American Statistical Association Pages: 999-1010 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1777136 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1777136 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:999-1010 Template-Type: ReDIF-Article 1.0 Author-Name: Nathan Kallus Author-X-Name-First: Nathan Author-X-Name-Last: Kallus Title: Rejoinder: New Objectives for Policy Learning Journal: Journal of the American Statistical Association Pages: 694-698 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1866580 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1866580 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:694-698 Template-Type: ReDIF-Article 1.0 Author-Name: Jared S. Murray Author-X-Name-First: Jared S. Author-X-Name-Last: Murray Title: Log-Linear Bayesian Additive Regression Trees for Multinomial Logistic and Count Regression Models Abstract: We introduce Bayesian additive regression trees (BART) for log-linear models including multinomial logistic regression and count regression with zero-inflation and overdispersion. BART has been applied to nonparametric mean regression and binary classification problems in a range of settings. However, existing applications of BART have been mostly limited to models for Gaussian “data,” either observed or latent. This is primarily because efficient MCMC algorithms are available for Gaussian likelihoods. But while many useful models are naturally cast in terms of latent Gaussian variables, many others are not—including models considered in this article. We develop new data augmentation strategies and carefully specified prior distributions for these new models. Like the original BART prior, the new prior distributions are carefully constructed and calibrated to be flexible while guarding against overfitting. Together the new priors and data augmentation schemes allow us to implement an efficient MCMC sampler outside the context of Gaussian models. The utility of these new methods is illustrated with examples and an application to a previously published dataset. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 756-769 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1813587 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1813587 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:756-769 Template-Type: ReDIF-Article 1.0 Author-Name: Srinjoy Das Author-X-Name-First: Srinjoy Author-X-Name-Last: Das Author-Name: Dimitris N. Politis Author-X-Name-First: Dimitris N. Author-X-Name-Last: Politis Title: Predictive Inference for Locally Stationary Time Series With an Application to Climate Data Abstract: The model-free prediction principle of Politis has been successfully applied to general regression problems, as well as problems involving stationary time series. However, with long time series, for example, annual temperature measurements spanning over 100 years or daily financial returns spanning several years, it may be unrealistic to assume stationarity throughout the span of the dataset. In this article, we show how model-free prediction can be applied to handle time series that are only locally stationary, that is, they can be assumed to be stationary only over short time-windows. Surprisingly, there is little literature on point prediction for general locally stationary time series even in model-based setups, and there is no literature whatsoever on the construction of prediction intervals of locally stationary time series. We attempt to fill this gap here as well. Both one-step-ahead point predictors and prediction intervals are constructed, and the performance of model-free is compared to model-based prediction using models that incorporate a trend and/or heteroscedasticity. Both aspects of the article, model-free and model-based, are novel in the context of time-series that are locally (but not globally) stationary. We also demonstrate the application of our model-based and model-free prediction methods to speleothem climate data which exhibits local stationarity and show that our best model-free point prediction results outperform that obtained with the RAMPFIT algorithm previously used for analysis of this type of data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 919-934 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2019.1708368 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1708368 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:919-934 Template-Type: ReDIF-Article 1.0 Author-Name: Francesca Tang Author-X-Name-First: Francesca Author-X-Name-Last: Tang Author-Name: Yang Feng Author-X-Name-First: Yang Author-X-Name-Last: Feng Author-Name: Hamza Chiheb Author-X-Name-First: Hamza Author-X-Name-Last: Chiheb Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Title: The Interplay of Demographic Variables and Social Distancing Scores in Deep Prediction of U.S. COVID-19 Cases Abstract: With the severity of the COVID-19 outbreak, we characterize the nature of the growth trajectories of counties in the United States using a novel combination of spectral clustering and the correlation matrix. As the United States and the rest of the world are still suffering from the effects of the virus, the importance of assigning growth membership to counties and understanding the determinants of the growth is increasingly evident. For the two communities (faster versus slower growth trajectories) we cluster the counties into, the average between-group correlation is 88.4% whereas the average within-group correlations are 95.0% and 93.8%. The average growth rate for one group is 0.1589 and 0.1704 for the other, further suggesting that our methodology captures meaningful differences between the nature of the growth across various counties. Subsequently, we select the demographic features that are most statistically significant in distinguishing the communities: number of grocery stores, number of bars, Asian population, White population, median household income, number of people with the bachelor’s degrees, and population density. Lastly, we effectively predict the future growth of a given county with a long short-term memory (LSTM) recurrent neural network using three social distancing scores. The best-performing model achieves a median out-of-sample R2 of 0.6251 for a four-day ahead prediction and we find that the number of communities and social distancing features play an important role in producing a more accurate forecasting. This comprehensive study captures the nature of the counties’ growth in cases at a very micro-level using growth communities, demographic factors, and social distancing performance to help government agencies utilize known information to make appropriate decisions regarding which potential counties to target resources and funding to. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 492-506 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2021.1901717 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1901717 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:492-506 Template-Type: ReDIF-Article 1.0 Author-Name: Sijia Li Author-X-Name-First: Sijia Author-X-Name-Last: Li Author-Name: Xiudi Li Author-X-Name-First: Xiudi Author-X-Name-Last: Li Author-Name: Alex Luedtke Author-X-Name-First: Alex Author-X-Name-Last: Luedtke Title: Discussion of Kallus (2020) and Mo, Qi, and Liu (2020): New Objectives for Policy Learning Journal: Journal of the American Statistical Association Pages: 680-689 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1837140 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1837140 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:680-689 Template-Type: ReDIF-Article 1.0 Author-Name: Yingda Jiang Author-X-Name-First: Yingda Author-X-Name-Last: Jiang Author-Name: Chi-Yang Chiu Author-X-Name-First: Chi-Yang Author-X-Name-Last: Chiu Author-Name: Qi Yan Author-X-Name-First: Qi Author-X-Name-Last: Yan Author-Name: Wei Chen Author-X-Name-First: Wei Author-X-Name-Last: Chen Author-Name: Michael B. Gorin Author-X-Name-First: Michael B. Author-X-Name-Last: Gorin Author-Name: Yvette P. Conley Author-X-Name-First: Yvette P. Author-X-Name-Last: Conley Author-Name: M’Hamed Lajmi Lakhal-Chaieb Author-X-Name-First: M’Hamed Lajmi Author-X-Name-Last: Lakhal-Chaieb Author-Name: Richard J. Cook Author-X-Name-First: Richard J. Author-X-Name-Last: Cook Author-Name: Christopher I. Amos Author-X-Name-First: Christopher I. Author-X-Name-Last: Amos Author-Name: Alexander F. Wilson Author-X-Name-First: Alexander F. Author-X-Name-Last: Wilson Author-Name: Joan E. Bailey-Wilson Author-X-Name-First: Joan E. Author-X-Name-Last: Bailey-Wilson Author-Name: Francis J. McMahon Author-X-Name-First: Francis J. Author-X-Name-Last: McMahon Author-Name: Ana I. Vazquez Author-X-Name-First: Ana I. Author-X-Name-Last: Vazquez Author-Name: Ao Yuan Author-X-Name-First: Ao Author-X-Name-Last: Yuan Author-Name: Xiaogang Zhong Author-X-Name-First: Xiaogang Author-X-Name-Last: Zhong Author-Name: Momiao Xiong Author-X-Name-First: Momiao Author-X-Name-Last: Xiong Author-Name: Daniel E. Weeks Author-X-Name-First: Daniel E. Author-X-Name-Last: Weeks Author-Name: Ruzong Fan Author-X-Name-First: Ruzong Author-X-Name-Last: Fan Title: Gene-Based Association Testing of Dichotomous Traits With Generalized Functional Linear Mixed Models Using Extended Pedigrees: Applications to Age-Related Macular Degeneration Abstract: Genetics plays a role in age-related macular degeneration (AMD), a common cause of blindness in the elderly. There is a need for powerful methods for carrying out region-based association tests between a dichotomous trait like AMD and genetic variants on family data. Here, we apply our new generalized functional linear mixed models (GFLMM) developed to test for gene-based association in a set of AMD families. Using common and rare variants, we observe significant association with two known AMD genes: CFH and ARMS2. Using rare variants, we find suggestive signals in four genes: ASAH1, CLEC6A, TMEM63C, and SGSM1. Intriguingly, ASAH1 is down-regulated in AMD aqueous humor, and ASAH1 deficiency leads to retinal inflammation and increased vulnerability to oxidative stress. These findings were made possible by our GFLMM which model the effect of a major gene as a fixed mean, the polygenic contributions as a random variation, and the correlation of pedigree members by kinship coefficients. Simulations indicate that the GFLMM likelihood ratio tests (LRTs) accurately control the Type I error rates. The LRTs have similar or higher power than existing retrospective kernel and burden statistics. Our GFLMM-based statistics provide a new tool for conducting family-based genetic studies of complex diseases. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 531-545 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1799809 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1799809 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:531-545 Template-Type: ReDIF-Article 1.0 Author-Name: Stijn Vansteelandt Author-X-Name-First: Stijn Author-X-Name-Last: Vansteelandt Author-Name: Oliver Dukes Author-X-Name-First: Oliver Author-X-Name-Last: Dukes Title: Discussion of Kallus and Mo, Qi, and Liu: New Objectives for Policy Learning Journal: Journal of the American Statistical Association Pages: 675-679 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1844718 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1844718 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:675-679 Template-Type: ReDIF-Article 1.0 Author-Name: Ian L. Dryden Author-X-Name-First: Ian L. Author-X-Name-Last: Dryden Author-Name: Alfred Kume Author-X-Name-First: Alfred Author-X-Name-Last: Kume Author-Name: Phillip J. Paine Author-X-Name-First: Phillip J. Author-X-Name-Last: Paine Author-Name: Andrew T. A. Wood Author-X-Name-First: Andrew T. A. Author-X-Name-Last: Wood Title: Regression Modeling for Size-and-Shape Data Based on a Gaussian Model for Landmarks Abstract: In this article, we propose a regression model for size-and-shape response data. So far as we are aware, few such models have been explored in the literature to date. We assume a Gaussian model for labeled landmarks; these landmarks are used to represent the random objects under study. The regression structure, assumed in this article to be linear in the ambient space, enters through the landmark means. Two approaches to parameter estimation are considered. The first approach is based directly on the marginal likelihood for the landmark-based shapes. In the second approach, we treat the orientations of the landmarks as missing data, and we set up a model-consistent estimation procedure for the parameters using the EM algorithm. Both approaches raise challenging computational issues which we explain how to deal with. The usefulness of this regression modeling framework is demonstrated through real-data examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1011-1022 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1724115 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1724115 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:1011-1022 Template-Type: ReDIF-Article 1.0 Author-Name: Rong Ma Author-X-Name-First: Rong Author-X-Name-Last: Ma Author-Name: T. Tony Cai Author-X-Name-First: T. Author-X-Name-Last: Tony Cai Author-Name: Hongzhe Li Author-X-Name-First: Hongzhe Author-X-Name-Last: Li Title: Global and Simultaneous Hypothesis Testing for High-Dimensional Logistic Regression Models Abstract: High-dimensional logistic regression is widely used in analyzing data with binary outcomes. In this article, global testing and large-scale multiple testing for the regression coefficients are considered in both single- and two-regression settings. A test statistic for testing the global null hypothesis is constructed using a generalized low-dimensional projection for bias correction and its asymptotic null distribution is derived. A lower bound for the global testing is established, which shows that the proposed test is asymptotically minimax optimal over some sparsity range. For testing the individual coefficients simultaneously, multiple testing procedures are proposed and shown to control the false discovery rate and falsely discovered variables asymptotically. Simulation studies are carried out to examine the numerical performance of the proposed tests and their superiority over existing methods. The testing procedures are also illustrated by analyzing a dataset of a metabolomics study that investigates the association between fecal metabolites and pediatric Crohn’s disease and the effects of treatment on such associations. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 984-998 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2019.1699421 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1699421 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:984-998 Template-Type: ReDIF-Article 1.0 Author-Name: Yifei Sun Author-X-Name-First: Yifei Author-X-Name-Last: Sun Author-Name: Charles E. McCulloch Author-X-Name-First: Charles E. Author-X-Name-Last: McCulloch Author-Name: Kieren A. Marr Author-X-Name-First: Kieren A. Author-X-Name-Last: Marr Author-Name: Chiung-Yu Huang Author-X-Name-First: Chiung-Yu Author-X-Name-Last: Huang Title: Recurrent Events Analysis With Data Collected at Informative Clinical Visits in Electronic Health Records Abstract: Although increasingly used as a data resource for assembling cohorts, electronic health records (EHRs) pose many analytic challenges. In particular, a patient’s health status influences when and what data are recorded, generating sampling bias in the collected data. In this article, we consider recurrent event analysis using EHR data. Conventional regression methods for event risk analysis usually require the values of covariates to be observed throughout the follow-up period. In EHR databases, time-dependent covariates are intermittently measured during clinical visits, and the timing of these visits is informative in the sense that it depends on the disease course. Simple methods, such as the last-observation-carried-forward approach, can lead to biased estimation. On the other hand, complex joint models require additional assumptions on the covariate process and cannot be easily extended to handle multiple longitudinal predictors. By incorporating sampling weights derived from estimating the observation time process, we develop a novel estimation procedure based on inverse-rate-weighting and kernel-smoothing for the semiparametric proportional rate model of recurrent events. The proposed methods do not require model specifications for the covariate processes and can easily handle multiple time-dependent covariates. Our methods are applied to a kidney transplant study for illustration. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 594-604 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1801447 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801447 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:594-604 Template-Type: ReDIF-Article 1.0 Author-Name: Min Jin Ha Author-X-Name-First: Min Jin Author-X-Name-Last: Ha Author-Name: Francesco Claudio Stingo Author-X-Name-First: Francesco Claudio Author-X-Name-Last: Stingo Author-Name: Veerabhadran Baladandayuthapani Author-X-Name-First: Veerabhadran Author-X-Name-Last: Baladandayuthapani Title: Bayesian Structure Learning in Multilayered Genomic Networks Abstract: Integrative network modeling of data arising from multiple genomic platforms provides insight into the holistic picture of the interactive system, as well as the flow of information across many disease domains including cancer. The basic data structure consists of a sequence of hierarchically ordered datasets for each individual subject, which facilitates integration of diverse inputs, such as genomic, transcriptomic, and proteomic data. A primary analytical task in such contexts is to model the layered architecture of networks where the vertices can be naturally partitioned into ordered layers, dictated by multiple platforms, and exhibit both undirected and directed relationships. We propose a multilayered Gaussian graphical model (mlGGM) to investigate conditional independence structures in such multilevel genomic networks in human cancers. We implement a Bayesian node-wise selection (BANS) approach based on variable selection techniques that coherently accounts for the multiple types of dependencies in mlGGM; this flexible strategy exploits edge-specific prior knowledge and selects sparse and interpretable models. Through simulated data generated under various scenarios, we demonstrate that BANS outperforms other existing multivariate regression-based methodologies. Our integrative genomic network analysis for key signaling pathways across multiple cancer types highlights commonalities and differences of p53 integrative networks and epigenetic effects of BRCA2 on p53 and its interaction with T68 phosphorylated CHK2, that may have translational utilities of finding biomarkers and therapeutic targets. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 605-618 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1775611 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1775611 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:605-618 Template-Type: ReDIF-Article 1.0 Author-Name: Xiaohan Yan Author-X-Name-First: Xiaohan Author-X-Name-Last: Yan Author-Name: Jacob Bien Author-X-Name-First: Jacob Author-X-Name-Last: Bien Title: Rare Feature Selection in High Dimensions Abstract: It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such “rare features” has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity. We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction method of multipliers. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 887-900 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1796677 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1796677 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:887-900 Template-Type: ReDIF-Article 1.0 Author-Name: Eric J. Tchetgen Tchetgen Author-X-Name-First: Eric J. Author-X-Name-Last: Tchetgen Tchetgen Author-Name: Isabel R. Fulcher Author-X-Name-First: Isabel R. Author-X-Name-Last: Fulcher Author-Name: Ilya Shpitser Author-X-Name-First: Ilya Author-X-Name-Last: Shpitser Title: Auto-G-Computation of Causal Effects on a Network Abstract: Methods for inferring average causal effects have traditionally relied on two key assumptions: (i) the intervention received by one unit cannot causally influence the outcome of another; and (ii) units can be organized into nonoverlapping groups such that outcomes of units in separate groups are independent. In this article, we develop new statistical methods for causal inference based on a single realization of a network of connected units for which neither assumption (i) nor (ii) holds. The proposed approach allows both for arbitrary forms of interference, whereby the outcome of a unit may depend on interventions received by other units with whom a network path through connected units exists; and long range dependence, whereby outcomes for any two units likewise connected by a path in the network may be dependent. Under network versions of consistency and no unobserved confounding, inference is made tractable by an assumption that the networks outcome, treatment and covariate vectors are a single realization of a certain chain graph model. This assumption allows inferences about various network causal effects via the auto-g-computation algorithm, a network generalization of Robins’ well-known g-computation algorithm previously described for causal inference under assumptions (i) and (ii). Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 833-844 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1811098 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1811098 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:833-844 Template-Type: ReDIF-Article 1.0 Author-Name: Cheng Zhang Author-X-Name-First: Cheng Author-X-Name-Last: Zhang Author-Name: Vu Dinh Author-X-Name-First: Vu Author-X-Name-Last: Dinh Author-Name: Frederick A. Matsen Author-X-Name-First: Frederick A. Author-X-Name-Last: Matsen Title: Nonbifurcating Phylogenetic Tree Inference via the Adaptive LASSO Abstract: Phylogenetic tree inference using deep DNA sequencing is reshaping our understanding of rapidly evolving systems, such as the within-host battle between viruses and the immune system. Densely sampled phylogenetic trees can contain special features, including sampled ancestors in which we sequence a genotype along with its direct descendants, and polytomies in which multiple descendants arise simultaneously. These features are apparent after identifying zero-length branches in the tree. However, current maximum-likelihood based approaches are not capable of revealing such zero-length branches. In this article, we find these zero-length branches by introducing adaptive-LASSO-type regularization estimators for the branch lengths of phylogenetic trees, deriving their properties, and showing regularization to be a practically useful approach for phylogenetics. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 858-873 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1778481 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1778481 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:858-873 Template-Type: ReDIF-Article 1.0 Author-Name: Kwonsang Lee Author-X-Name-First: Kwonsang Author-X-Name-Last: Lee Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Author-Name: Francesca Dominici Author-X-Name-First: Francesca Author-X-Name-Last: Dominici Title: Discovering Heterogeneous Exposure Effects Using Randomization Inference in Air Pollution Studies Abstract: Several studies have provided strong evidence that long-term exposure to air pollution, even at low levels, increases risk of mortality. As regulatory actions are becoming prohibitively expensive, robust evidence to guide the development of targeted interventions to protect the most vulnerable is needed. In this article, we introduce a novel statistical method that (i) discovers subgroups whose effects substantially differ from the population mean, and (ii) uses randomization-based tests to assess discovered heterogeneous effects. Also, we develop a sensitivity analysis method to assess the robustness of the conclusions to unmeasured confounding bias. Via simulation studies and theoretical arguments, we demonstrate that hypothesis testing focusing on the discovered subgroups can substantially increase statistical power to detect heterogeneity of the exposure effects. We apply the proposed de novo method to the data of 1,612,414 Medicare beneficiaries in the New England region in the United States for the period 2000–2006. We find that seniors aged between 81 and 85 with low income and seniors aged 85 and above have statistically significant greater causal effects of long-term exposure to PM2.5 on 5-year mortality rate compared to the population mean. Journal: Journal of the American Statistical Association Pages: 569-580 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1870476 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1870476 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:569-580 Template-Type: ReDIF-Article 1.0 Author-Name: Laura Forastiere Author-X-Name-First: Laura Author-X-Name-Last: Forastiere Author-Name: Edoardo M. Airoldi Author-X-Name-First: Edoardo M. Author-X-Name-Last: Airoldi Author-Name: Fabrizia Mealli Author-X-Name-First: Fabrizia Author-X-Name-Last: Mealli Title: Identification and Estimation of Treatment and Interference Effects in Observational Studies on Networks Abstract: Abstract–Causal inference on a population of units connected through a network often presents technical challenges, including how to account for interference. In the presence of interference, for instance, potential outcomes of a unit depend on their treatment as well as on the treatments of other units, such as their neighbors in the network. In observational studies, a further complication is that the typical unconfoundedness assumption must be extended—say, to include the treatment of neighbors, and individual and neighborhood covariates—to guarantee identification and valid inference. Here, we propose new estimands that define treatment and interference effects. We then derive analytical expressions for the bias of a naive estimator that wrongly assumes away interference. The bias depends on the level of interference but also on the degree of association between individual and neighborhood treatments. We propose an extended unconfoundedness assumption that accounts for interference, and we develop new covariate-adjustment methods that lead to valid estimates of treatment and interference effects in observational studies on networks. Estimation is based on a generalized propensity score that balances individual and neighborhood covariates across units under different levels of individual treatment and of exposure to neighbors’ treatment. We carry out simulations, calibrated using friendship networks and covariates in a nationally representative longitudinal study of adolescents in grades 7–12 in the United States, to explore finite-sample performance in different realistic settings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 901-918 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1768100 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1768100 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:901-918 Template-Type: ReDIF-Article 1.0 Author-Name: Zhicheng Ji Author-X-Name-First: Zhicheng Author-X-Name-Last: Ji Author-Name: Hongkai Ji Author-X-Name-First: Hongkai Author-X-Name-Last: Ji Title: Discussion of “Exponential-Family Embedding With Application to Cell Developmental Trajectories for Single-Cell RNA-seq Data” Journal: Journal of the American Statistical Association Pages: 471-474 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2021.1880920 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1880920 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:471-474 Template-Type: ReDIF-Article 1.0 Author-Name: Danijel Kivaranovic Author-X-Name-First: Danijel Author-X-Name-Last: Kivaranovic Author-Name: Hannes Leeb Author-X-Name-First: Hannes Author-X-Name-Last: Leeb Title: On the Length of Post-Model-Selection Confidence Intervals Conditional on Polyhedral Constraints Abstract: Valid inference after model selection is currently a very active area of research. The polyhedral method, introduced in an article by Lee et al., allows for valid inference after model selection if the model selection event can be described by polyhedral constraints. In that reference, the method is exemplified by constructing two valid confidence intervals when the Lasso estimator is used to select a model. We here study the length of these intervals. For one of these confidence intervals, which is easier to compute, we find that its expected length is always infinite. For the other of these confidence intervals, whose computation is more demanding, we give a necessary and sufficient condition for its expected length to be infinite. In simulations, we find that this sufficient condition is typically satisfied, unless the selected model includes almost all or almost none of the available regressors. For the distribution of confidence interval length, we find that the κ-quantiles behave like 1/(1−κ) for κ close to 1. Our results can also be used to analyze other confidence intervals that are based on the polyhedral method. Journal: Journal of the American Statistical Association Pages: 845-857 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1732989 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1732989 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:845-857 Template-Type: ReDIF-Article 1.0 Author-Name: Muxuan Liang Author-X-Name-First: Muxuan Author-X-Name-Last: Liang Author-Name: Ying-Qi Zhao Author-X-Name-First: Ying-Qi Author-X-Name-Last: Zhao Title: Discussion of Kallus (2020) and Mo et al. (2020) Abstract: We discuss the results on improving the generalizability of individualized treatment rule following the work by Kallus and Mo et al. We note that the advocated weights in the work of Kallus are connected to the efficient score of the contrast function. We further propose a likelihood-ratio-based method (LR-ITR) to accommodate covariate shifts, and compare it to the CTE-DR-ITR method proposed by Mo et al. We provide the upper-bound on the risk function of the target population when both the covariate shift and the contrast function shift are present. Numerical studies show that LR-ITR can outperform CTE-DR-ITR when there is only covariate shift. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 690-693 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1833887 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1833887 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:690-693 Template-Type: ReDIF-Article 1.0 Author-Name: Ting Tian Author-X-Name-First: Ting Author-X-Name-Last: Tian Author-Name: Jianbin Tan Author-X-Name-First: Jianbin Author-X-Name-Last: Tan Author-Name: Wenxiang Luo Author-X-Name-First: Wenxiang Author-X-Name-Last: Luo Author-Name: Yukang Jiang Author-X-Name-First: Yukang Author-X-Name-Last: Jiang Author-Name: Minqiong Chen Author-X-Name-First: Minqiong Author-X-Name-Last: Chen Author-Name: Songpan Yang Author-X-Name-First: Songpan Author-X-Name-Last: Yang Author-Name: Canhong Wen Author-X-Name-First: Canhong Author-X-Name-Last: Wen Author-Name: Wenliang Pan Author-X-Name-First: Wenliang Author-X-Name-Last: Pan Author-Name: Xueqin Wang Author-X-Name-First: Xueqin Author-X-Name-Last: Wang Title: The Effects of Stringent and Mild Interventions for Coronavirus Pandemic Abstract: The pandemic of COVID-19 has caused severe public health consequences around the world. Many interventions of COVID-19 have been implemented. It is of great public health and social importance to evaluate the effects of interventions in the pandemic of COVID-19. With the help of a synthetic control method, the regression discontinuity, and a state-space compartmental model, we evaluated the treatment and stagewise effects of the intervention policies. We found statistically significant treatment effects of broad stringent interventions in Wenzhou and mild interventions in Shanghai to subdue the epidemic’s spread. If those reduction effects were not activated, the expected number of positive individuals would increase by 2.18 times on February 5, 2020, for Wenzhou and 7.69 times on February 4, 2020, for Shanghai, respectively. Alternatively, regression discontinuity elegantly identified the stringent (p-value: <0.001) and mild interventions (p-value: 0.024) lowered the severity of the epidemic. Under the compartmental modeling for different interventions, we understood the importance of implementing the interventions. The highest level alert to COVID-19 was practical and crucial at the early stage of the epidemic. Furthermore, the physical/social distancing policy was necessary once the spread of COVID-19 continued. If appropriate control measures were implemented, then epidemic would be under control effectively and early. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 481-491 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2021.1897015 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1897015 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:481-491 Template-Type: ReDIF-Article 1.0 Author-Name: Dungang Liu Author-X-Name-First: Dungang Author-X-Name-Last: Liu Author-Name: Shaobo Li Author-X-Name-First: Shaobo Author-X-Name-Last: Li Author-Name: Yan Yu Author-X-Name-First: Yan Author-X-Name-Last: Yu Author-Name: Irini Moustaki Author-X-Name-First: Irini Author-X-Name-Last: Moustaki Title: Assessing Partial Association Between Ordinal Variables: Quantification, Visualization, and Hypothesis Testing Abstract: Partial association refers to the relationship between variables Y1,Y2,…,YK while adjusting for a set of covariates X={X1,…,Xp}. To assess such an association when Yk’s are recorded on ordinal scales, a classical approach is to use partial correlation between the latent continuous variables. This so-called polychoric correlation is inadequate, as it requires multivariate normality and it only reflects a linear association. We propose a new framework for studying ordinal-ordinal partial association by using Liu-Zhang’s surrogate residuals. We justify that conditional on X, Yk, and Yl are independent if and only if their corresponding surrogate residual variables are independent. Based on this result, we develop a general measure ϕ to quantify association strength. As opposed to polychoric correlation, ϕ does not rely on normality or models with the probit link, but instead it broadly applies to models with any link functions. It can capture a nonlinear or even nonmonotonic association. Moreover, the measure ϕ gives rise to a general procedure for testing the hypothesis of partial independence. Our framework also permits visualization tools, such as partial regression plots and three-dimensional P-P plots, to examine the association structure, which is otherwise unfeasible for ordinal data. We stress that the whole set of tools (measures, p-values, and graphics) is developed within a single unified framework, which allows a coherent inference. The analyses of the National Election Study (K = 5) and Big Five Personality Traits (K = 50) demonstrate that our framework leads to a much fuller assessment of partial association and yields deeper insights for domain researchers. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 955-968 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1796394 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1796394 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:955-968 Template-Type: ReDIF-Article 1.0 Author-Name: Karthika Mohan Author-X-Name-First: Karthika Author-X-Name-Last: Mohan Author-Name: Judea Pearl Author-X-Name-First: Judea Author-X-Name-Last: Pearl Title: Graphical Models for Processing Missing Data Abstract: This article reviews recent advances in missing data research using graphical models to represent multivariate dependencies. We first examine the limitations of traditional frameworks from three different perspectives: transparency, estimability, and testability. We then show how procedures based on graphical models can overcome these limitations and provide meaningful performance guarantees even when data are missing not at random (MNAR). In particular, we identify conditions that guarantee consistent estimation in broad categories of missing data problems, and derive procedures for implementing this estimation. Finally, we derive testable implications for missing data models in both missing at random and MNAR categories. Journal: Journal of the American Statistical Association Pages: 1023-1037 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2021.1874961 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1874961 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:1023-1037 Template-Type: ReDIF-Article 1.0 Author-Name: Kosuke Imai Author-X-Name-First: Kosuke Author-X-Name-Last: Imai Author-Name: Zhichao Jiang Author-X-Name-First: Zhichao Author-X-Name-Last: Jiang Author-Name: Anup Malani Author-X-Name-First: Anup Author-X-Name-Last: Malani Title: Causal Inference With Interference and Noncompliance in Two-Stage Randomized Experiments Abstract: In many social science experiments, subjects often interact with each other and as a result one unit’s treatment influences the outcome of another unit. Over the last decade, a significant progress has been made toward causal inference in the presence of such interference between units. Researchers have shown that the two-stage randomization of treatment assignment enables the identification of average direct and spillover effects. However, much of the literature has assumed perfect compliance with treatment assignment. In this article, we establish the nonparametric identification of the complier average direct and spillover effects in two-stage randomized experiments with interference and noncompliance. In particular, we consider the spillover effect of the treatment assignment on the treatment receipt as well as the spillover effect of the treatment receipt on the outcome. We propose consistent estimators and derive their randomization-based variances under the stratified interference assumption. We also prove the exact relationships between the proposed randomization-based estimators and the popular two-stage least squares estimators. The proposed methodology is motivated by and applied to our own randomized evaluation of India’s National Health Insurance Program (RSBY), where we find some evidence of spillover effects. The proposed methods are implemented via an open-source software package. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 632-644 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1775612 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1775612 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:632-644 Template-Type: ReDIF-Article 1.0 Author-Name: Haoyu Chen Author-X-Name-First: Haoyu Author-X-Name-Last: Chen Author-Name: Wenbin Lu Author-X-Name-First: Wenbin Author-X-Name-Last: Lu Author-Name: Rui Song Author-X-Name-First: Rui Author-X-Name-Last: Song Title: Statistical Inference for Online Decision Making via Stochastic Gradient Descent Abstract: Online decision making aims to learn the optimal decision rule by making personalized decisions and updating the decision rule recursively. It has become easier than before with the help of big data, but new challenges also come along. Since the decision rule should be updated once per step, an offline update which uses all the historical data is inefficient in computation and storage. To this end, we propose a completely online algorithm that can make decisions and update the decision rule online via stochastic gradient descent. It is not only efficient but also supports all kinds of parametric reward models. Focusing on the statistical inference of online decision making, we establish the asymptotic normality of the parameter estimator produced by our algorithm and the online inverse probability weighted value estimator we used to estimate the optimal value. Online plugin estimators for the variance of the parameter and value estimators are also provided and shown to be consistent, so that interval estimation and hypothesis test are possible using our method. The proposed algorithm and theoretical results are tested by simulations and a real data application to news article recommendation. Journal: Journal of the American Statistical Association Pages: 708-719 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1826325 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1826325 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:708-719 Template-Type: ReDIF-Article 1.0 Author-Name: Ricardo Moura Author-X-Name-First: Ricardo Author-X-Name-Last: Moura Author-Name: Martin Klein Author-X-Name-First: Martin Author-X-Name-Last: Klein Author-Name: John Zylstra Author-X-Name-First: John Author-X-Name-Last: Zylstra Author-Name: Carlos A. Coelho Author-X-Name-First: Carlos A. Author-X-Name-Last: Coelho Author-Name: Bimal Sinha Author-X-Name-First: Bimal Author-X-Name-Last: Sinha Title: Inference for Multivariate Regression Model Based on Synthetic Data Generated Using Plug-in Sampling Abstract: In this article, the authors derive the likelihood-based exact inference for singly and multiply imputed synthetic data in the context of a multivariate regression model. The synthetic data are generated via the Plug-in Sampling method, where the unknown parameters in the model are set equal to the observed values of their point estimators based on the original data, and synthetic data are drawn from this estimated version of the model. Simulation studies are carried out in order to confirm the theoretical results. The authors provide exact test procedures, which in case multiple synthetic datasets are permissible, are compared with the asymptotic results of Reiter. An application using 2000 U.S. Current Population Survey public use data is discussed. Furthermore, properties of the proposed methodology are evaluated in scenarios where some of the conditions that were used to derive the methodology do not hold, namely for nonnormal and discrete distributed random variables, cases in which the inferential procedures developed still show very good performances. Journal: Journal of the American Statistical Association Pages: 720-733 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2021.1900860 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1900860 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:720-733 Template-Type: ReDIF-Article 1.0 Author-Name: Joshua Lukemire Author-X-Name-First: Joshua Author-X-Name-Last: Lukemire Author-Name: Suprateek Kundu Author-X-Name-First: Suprateek Author-X-Name-Last: Kundu Author-Name: Giuseppe Pagnoni Author-X-Name-First: Giuseppe Author-X-Name-Last: Pagnoni Author-Name: Ying Guo Author-X-Name-First: Ying Author-X-Name-Last: Guo Title: Bayesian Joint Modeling of Multiple Brain Functional Networks Abstract: Investigating the similarity and changes in brain networks under different mental conditions has become increasingly important in neuroscience research. A standard separate estimation strategy fails to pool information across networks and hence has reduced estimation accuracy and power to detect between-network differences. Motivated by an fMRI Stroop task experiment that involves multiple related tasks, we develop an integrative Bayesian approach for jointly modeling multiple brain networks that provides a systematic inferential framework for network comparisons. The proposed approach explicitly models shared and differential patterns via flexible Dirichlet process-based priors on edge probabilities. Conditional on edges, the connection strengths are modeled via Bayesian spike-and-slab prior on the precision matrix off-diagonals. Numerical simulations illustrate that the proposed approach has increased power to detect true differential edges while providing adequate control on false positives and achieves greater network estimation accuracy compared to existing methods. The Stroop task data analysis reveals greater connectivity differences between task and fixation that are concentrated in brain regions previously identified as differentially activated in Stroop task, and more nuanced connectivity differences between exertion and relaxed task. In contrast, penalized modeling approaches involving computationally burdensome permutation tests reveal negligible network differences between conditions that seem biologically implausible. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 518-530 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1796357 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1796357 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:518-530 Template-Type: ReDIF-Article 1.0 Author-Name: Jordan Awan Author-X-Name-First: Jordan Author-X-Name-Last: Awan Author-Name: Aleksandra Slavković Author-X-Name-First: Aleksandra Author-X-Name-Last: Slavković Title: Structure and Sensitivity in Differential Privacy: Comparing K-Norm Mechanisms Abstract: Differential privacy (DP) provides a framework for provable privacy protection against arbitrary adversaries, while allowing the release of summary statistics and synthetic data. We address the problem of releasing a noisy real-valued statistic vector T, a function of sensitive data under DP, via the class of K-norm mechanisms with the goal of minimizing the noise added to achieve privacy. First, we introduce the sensitivity space of T, which extends the concepts of sensitivity polytope and sensitivity hull to the setting of arbitrary statistics T. We then propose a framework consisting of three methods for comparing the K-norm mechanisms: (1) a multivariate extension of stochastic dominance, (2) the entropy of the mechanism, and (3) the conditional variance given a direction, to identify the optimal K-norm mechanism. In all of these criteria, the optimal K-norm mechanism is generated by the convex hull of the sensitivity space. Using our methodology, we extend the objective perturbation and functional mechanisms and apply these tools to logistic and linear regression, allowing for private releases of statistical results. Via simulations and an application to a housing price dataset, we demonstrate that our proposed methodology offers a substantial improvement in utility for the same level of risk. Journal: Journal of the American Statistical Association Pages: 935-954 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1773831 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1773831 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:935-954 Template-Type: ReDIF-Article 1.0 Author-Name: Chi Wing Chu Author-X-Name-First: Chi Wing Author-X-Name-Last: Chu Author-Name: Tony Sit Author-X-Name-First: Tony Author-X-Name-Last: Sit Author-Name: Gongjun Xu Author-X-Name-First: Gongjun Author-X-Name-Last: Xu Title: Transformed Dynamic Quantile Regression on Censored Data Abstract: We propose a class of power-transformed linear quantile regression models for time-to-event observations subject to censoring. By introducing a process of power transformation with different transformation parameters at individual quantile levels, our framework relaxes the assumption of logarithmic transformation on survival times and provides dynamic estimation of various quantile levels. With such formulation, our proposal no longer requires the potentially restrictive global linearity assumption imposed on a class of existing inference procedures for censored quantile regression. Uniform consistency and weak convergence of the proposed estimator as a process of quantile levels are established via the martingale-based argument. Numerical studies are presented to illustrate the outperformance of the proposed estimator over existing contenders under various settings. Journal: Journal of the American Statistical Association Pages: 874-886 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2019.1695623 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1695623 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:874-886 Template-Type: ReDIF-Article 1.0 Author-Name: Kiranmoy Das Author-X-Name-First: Kiranmoy Author-X-Name-Last: Das Author-Name: Pulak Ghosh Author-X-Name-First: Pulak Author-X-Name-Last: Ghosh Author-Name: Michael J. Daniels Author-X-Name-First: Michael J. Author-X-Name-Last: Daniels Title: Modeling Multiple Time-Varying Related Groups: A Dynamic Hierarchical Bayesian Approach With an Application to the Health and Retirement Study Abstract: As the population of the older individuals continues to grow, it is important to study the relationship among the variables measuring financial health and physical health of the older individuals to better understand the demand for healthcare, and health insurance. We propose a semiparametric approach to jointly model these variables. We use data from the Health and Retirement Study which includes a set of correlated longitudinal variables measuring financial and physical health. In particular, we propose a dynamic hierarchical matrix stick-breaking process prior for some of the model parameters to account for the time dependent aspects of our data. This prior introduces dependence among the parameters across different groups which varies over time. A Lasso type shrinkage prior is specified for the covariates with time-invariant effects for selecting the set of covariates with significant effects on the outcomes. Through joint modeling, we are able to study the physical health of the older individuals conditional on their financial health, and vice-versa. Based on our analysis, we find that the health insurance (medicare) provided by the government (of the United States) to the older individuals is very effective, and it covers most of the medical expenditures. However, none of the health insurances conveniently cover the additional medical expenses due to chronic diseases like cancer and heart problem. Simulation studies are performed to assess the operating characteristics of our proposed modeling approach. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 558-568 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2021.1886105 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1886105 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:558-568 Template-Type: ReDIF-Article 1.0 Author-Name: Maxime Rischard Author-X-Name-First: Maxime Author-X-Name-Last: Rischard Author-Name: Zach Branson Author-X-Name-First: Zach Author-X-Name-Last: Branson Author-Name: Luke Miratrix Author-X-Name-First: Luke Author-X-Name-Last: Miratrix Author-Name: Luke Bornn Author-X-Name-First: Luke Author-X-Name-Last: Bornn Title: Do School Districts Affect NYC House Prices? Identifying Border Differences Using a Bayesian Nonparametric Approach to Geographic Regression Discontinuity Designs Abstract: What is the premium on house price for a particular school district? To estimate this in New York City we use a novel implementation of a geographic regression discontinuity design (GeoRDD) built from Gaussian processes regression (kriging) to model spatial structure. With a GeoRDD, we specifically examine price differences along borders between “treatment” and “control” school districts. GeoRDDs extend RDDs to multivariate settings; location is the forcing variable and the border between school districts constitutes the discontinuity threshold. We first obtain a Bayesian posterior distribution of the price difference function, our nominal treatment effect, along the border. We then address nuances of having a functional estimand defined on a border with potentially intricate topology, particularly when defining and estimating causal estimands of the local average treatment effect (LATE). We test for nonzero LATE with a calibrated hypothesis test with good frequentist properties, which we further validate using a placebo test. Using our methodology, we identify substantial differences in price across several borders. In one case, a border separating Brooklyn and Queens, we estimate a statistically significant 20% higher price for a house on the more desirable side. We also find that geographic features can undermine some of these comparisons. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 619-631 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1817749 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1817749 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:619-631 Template-Type: ReDIF-Article 1.0 Author-Name: Sharmistha Guha Author-X-Name-First: Sharmistha Author-X-Name-Last: Guha Author-Name: Abel Rodriguez Author-X-Name-First: Abel Author-X-Name-Last: Rodriguez Title: Bayesian Regression With Undirected Network Predictors With an Application to Brain Connectome Data Abstract: This article focuses on the relationship between a measure of creativity and the human brain network for subjects in a brain connectome dataset obtained using a diffusion weighted magnetic resonance imaging procedure. We identify brain regions and interconnections that have a significant effect on creativity. Brain networks are often expressed in terms of symmetric adjacency matrices, with row and column indices of the matrix representing the regions of interest (ROI), and a cell entry signifying the estimated number of fiber bundles connecting the corresponding row and column ROIs. Current statistical practices for regression analysis with the brain network as the predictor and the measure of creativity as the response typically vectorize the network predictor matrices prior to any analysis, thus failing to account for the important structural information in the network. This results in poor inferential and predictive performance in presence of small sample sizes. To answer the scientific questions discussed above, we develop a flexible Bayesian framework that avoids reshaping the network predictor matrix, draws inference on brain ROIs and interconnections significantly related to creativity, and enables accurate prediction of creativity from a brain network. A novel class of network shrinkage priors for the coefficient corresponding to the network predictor is proposed to achieve these goals simultaneously. The Bayesian framework allows characterization of uncertainty in the findings. Empirical results in simulation studies illustrate substantial inferential and predictive gains of the proposed framework in comparison with the ordinary high-dimensional Bayesian shrinkage priors and penalized optimization schemes. Our framework yields new insights into the relationship of brain regions with creativity, also providing the uncertainty associated with the scientific findings. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 581-593 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1772079 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1772079 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:581-593 Template-Type: ReDIF-Article 1.0 Author-Name: Weibin Mo Author-X-Name-First: Weibin Author-X-Name-Last: Mo Author-Name: Zhengling Qi Author-X-Name-First: Zhengling Author-X-Name-Last: Qi Author-Name: Yufeng Liu Author-X-Name-First: Yufeng Author-X-Name-Last: Liu Title: Rejoinder: Learning Optimal Distributionally Robust Individualized Treatment Rules Journal: Journal of the American Statistical Association Pages: 699-707 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1866581 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1866581 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:699-707 Template-Type: ReDIF-Article 1.0 Author-Name: Kevin Z. Lin Author-X-Name-First: Kevin Z. Author-X-Name-Last: Lin Author-Name: Jing Lei Author-X-Name-First: Jing Author-X-Name-Last: Lei Author-Name: Kathryn Roeder Author-X-Name-First: Kathryn Author-X-Name-Last: Roeder Title: Exponential-Family Embedding With Application to Cell Developmental Trajectories for Single-Cell RNA-Seq Data Abstract: Scientists often embed cells into a lower-dimensional space when studying single-cell RNA-seq data for improved downstream analyses such as developmental trajectory analyses, but the statistical properties of such nonlinear embedding methods are often not well understood. In this article, we develop the exponential-family SVD (eSVD), a nonlinear embedding method for both cells and genes jointly with respect to a random dot product model using exponential-family distributions. Our estimator uses alternating minimization, which enables us to have a computationally efficient method, prove the identifiability conditions and consistency of our method, and provide statistically principled procedures to tune our method. All these qualities help advance the single-cell embedding literature, and we provide extensive simulations to demonstrate that the eSVD is competitive compared to other embedding methods. We apply the eSVD via Gaussian distributions where the standard deviations are proportional to the means to analyze a single-cell dataset of oligodendrocytes in mouse brains. Using the eSVD estimated embedding, we then investigate the cell developmental trajectories of the oligodendrocytes. While previous results are not able to distinguish the trajectories among the mature oligodendrocyte cell types, our diagnostics and results demonstrate there are two major developmental trajectories that diverge at mature oligodendrocytes. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplementary materials. Journal: Journal of the American Statistical Association Pages: 457-470 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2021.1886106 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1886106 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:457-470 Template-Type: ReDIF-Article 1.0 Author-Name: Philip G. Sansom Author-X-Name-First: Philip G. Author-X-Name-Last: Sansom Author-Name: David B. Stephenson Author-X-Name-First: David B. Author-X-Name-Last: Stephenson Author-Name: Thomas J. Bracegirdle Author-X-Name-First: Thomas J. Author-X-Name-Last: Bracegirdle Title: On Constraining Projections of Future Climate Using Observations and Simulations From Multiple Climate Models Abstract: Numerical climate models are used to project future climate change due to both anthropogenic and natural causes. Differences between projections from different climate models are a major source of uncertainty about future climate. Emergent relationships shared by multiple climate models have the potential to constrain our uncertainty when combined with historical observations. We combine projections from 13 climate models with observational data to quantify the impact of emergent relationships on projections of future warming in the Arctic at the end of the 21st century. We propose a hierarchical Bayesian framework based on a coexchangeable representation of the relationship between climate models and the Earth system. We show how emergent constraints fit into the coexchangeable representation, and extend it to account for internal variability simulated by the models and natural variability in the Earth system. Our analysis shows that projected warming in some regions of the Arctic may be more than 2 °C lower and our uncertainty reduced by up to 30% when constrained by historical observations. A detailed theoretical comparison with existing multi-model projection frameworks is also provided. In particular, we show that projections may be biased if we do not account for internal variability in climate model predictions. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 546-557 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1851696 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1851696 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:546-557 Template-Type: ReDIF-Article 1.0 Author-Name: Justin Khim Author-X-Name-First: Justin Author-X-Name-Last: Khim Author-Name: Po-Ling Loh Author-X-Name-First: Po-Ling Author-X-Name-Last: Loh Title: Permutation Tests for Infection Graphs Abstract: We formulate and analyze a novel hypothesis testing problem for inferring the edge structure of an infection graph. In our model, a disease spreads over a network via contagion or random infection, where the times between successive contagion events are independent exponential random variables with unknown rate parameters. A subset of nodes is also censored uniformly at random. Given the observed infection statuses of nodes in the network, the goal is to determine the underlying graph. We present a procedure based on permutation testing, and we derive sufficient conditions for the validity of our test in terms of automorphism groups of the graphs corresponding to the null and alternative hypotheses. Our test is easy to compute and does not involve estimating unknown parameters governing the process. We also derive risk bounds for our permutation test in a variety of settings, and relate our test statistic to approximate likelihood ratio testing and maximin tests. For graphs not satisfying the necessary symmetries, we provide an additional method for testing the significance of the graph structure, albeit at a higher computational cost. We conclude with an application to real data from an HIV infection network. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 770-782 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2019.1700128 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1700128 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:770-782 Template-Type: ReDIF-Article 1.0 Author-Name: Joris Chau Author-X-Name-First: Joris Author-X-Name-Last: Chau Author-Name: Rainer von Sachs Author-X-Name-First: Rainer Author-X-Name-Last: von Sachs Title: Intrinsic Wavelet Regression for Curves of Hermitian Positive Definite Matrices Abstract: Intrinsic wavelet transforms and wavelet estimation methods are introduced for curves in the non-Euclidean space of Hermitian positive definite matrices, with in mind the application to Fourier spectral estimation of multivariate stationary time series. The main focus is on intrinsic average-interpolation wavelet transforms in the space of positive definite matrices equipped with an affine-invariant Riemannian metric, and convergence rates of linear wavelet thresholding are derived for intrinsically smooth curves of Hermitian positive definite matrices. In the context of multivariate Fourier spectral estimation, intrinsic wavelet thresholding is equivariant under a change of basis of the time series, and nonlinear wavelet thresholding is able to capture localized features in the spectral density matrix across frequency, always guaranteeing positive definite estimates. The finite-sample performance of intrinsic wavelet thresholding is assessed by means of simulated data and compared to several benchmark estimators in the Riemannian manifold. Further illustrations are provided by examining the multivariate spectra of trial-replicated brain signal time series recorded during a learning experiment. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 819-832 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2019.1700129 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1700129 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:819-832 Template-Type: ReDIF-Article 1.0 Author-Name: Jasjeet S. Sekhon Author-X-Name-First: Jasjeet S. Author-X-Name-Last: Sekhon Author-Name: Yotam Shem-Tov Author-X-Name-First: Yotam Author-X-Name-Last: Shem-Tov Title: Inference on a New Class of Sample Average Treatment Effects Abstract: We derive new variance formulas for inference on a general class of estimands of causal average treatment effects in a randomized control trial. We generalize the seminal work of Robins and show that when the researcher’s objective is inference on sample average treatment effect of the treated (SATT), a consistent variance estimator exists. Although this estimand is equal to the sample average treatment effect (SATE) in expectation, potentially large differences in both accuracy and coverage can occur by the change of estimand, even asymptotically. Inference on SATE, even using a conservative confidence interval, provides incorrect coverage of SATT. We demonstrate the applicability of the new theoretical results using an empirical application with hundreds of online experiments with an average sample size of approximately 100 million observations per experiment. An R package, estCI, that implements all the proposed estimation procedures is available. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 798-804 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1730854 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1730854 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:798-804 Template-Type: ReDIF-Article 1.0 Author-Name: Jian Hu Author-X-Name-First: Jian Author-X-Name-Last: Hu Author-Name: Mingyao Li Author-X-Name-First: Mingyao Author-X-Name-Last: Li Title: Discussion of “Exponential-Family Embedding With Application to Cell Developmental Trajectories for Single-Cell RNA-Seq Data” Journal: Journal of the American Statistical Association Pages: 475-477 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2021.1880919 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1880919 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:475-477 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Correction Journal: Journal of the American Statistical Association Pages: 1039-1039 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2021.1915023 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1915023 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:1039-1039 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Introduction to Theory and Methods Special Issue on Precision Medicine and Individualized Policy Discovery Part II Journal: Journal of the American Statistical Association Pages: 645-645 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2021.1916266 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1916266 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:645-645 Template-Type: ReDIF-Article 1.0 Author-Name: Marco Avella-Medina Author-X-Name-First: Marco Author-X-Name-Last: Avella-Medina Title: Privacy-Preserving Parametric Inference: A Case for Robust Statistics Abstract: Differential privacy is a cryptographically motivated approach to privacy that has become a very active field of research over the last decade in theoretical computer science and machine learning. In this paradigm, one assumes there is a trusted curator who holds the data of individuals in a database and the goal of privacy is to simultaneously protect individual data while allowing the release of global characteristics of the database. In this setting, we introduce a general framework for parametric inference with differential privacy guarantees. We first obtain differentially private estimators based on bounded influence M-estimators by leveraging their gross-error sensitivity in the calibration of a noise term added to them to ensure privacy. We then show how a similar construction can also be applied to construct differentially private test statistics analogous to the Wald, score, and likelihood ratio tests. We provide statistical guarantees for all our proposals via an asymptotic analysis. An interesting consequence of our results is to further clarify the connection between differential privacy and robust statistics. In particular, we demonstrate that differential privacy is a weaker stability requirement than infinitesimal robustness, and show that robust M-estimators can be easily randomized to guarantee both differential privacy and robustness toward the presence of contaminated data. We illustrate our results both on simulated and real data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 969-983 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2019.1700130 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1700130 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:969-983 Template-Type: ReDIF-Article 1.0 Author-Name: Emily C. Hector Author-X-Name-First: Emily C. Author-X-Name-Last: Hector Author-Name: Peter X.-K. Song Author-X-Name-First: Peter X.-K. Author-X-Name-Last: Song Title: A Distributed and Integrated Method of Moments for High-Dimensional Correlated Data Analysis Abstract: This article is motivated by a regression analysis of electroencephalography (EEG) neuroimaging data with high-dimensional correlated responses with multilevel nested correlations. We develop a divide-and-conquer procedure implemented in a fully distributed and parallelized computational scheme for statistical estimation and inference of regression parameters. Despite significant efforts in the literature, the computational bottleneck associated with high-dimensional likelihoods prevents the scalability of existing methods. The proposed method addresses this challenge by dividing responses into subvectors to be analyzed separately and in parallel on a distributed platform using pairwise composite likelihood. Theoretical challenges related to combining results from dependent data are overcome in a statistically efficient way using a meta-estimator derived from Hansen’s generalized method of moments. We provide a rigorous theoretical framework for efficient estimation, inference, and goodness-of-fit tests. We develop an R package for ease of implementation. We illustrate our method’s performance with simulations and the analysis of the EEG data, and find that iron deficiency is significantly associated with two auditory recognition memory related potentials in the left parietal-occipital region of the brain. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 805-818 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1736082 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1736082 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:805-818 Template-Type: ReDIF-Article 1.0 Author-Name: Paolo Frumento Author-X-Name-First: Paolo Author-X-Name-Last: Frumento Author-Name: Matteo Bottai Author-X-Name-First: Matteo Author-X-Name-Last: Bottai Author-Name: Iván Fernández-Val Author-X-Name-First: Iván Author-X-Name-Last: Fernández-Val Title: Parametric Modeling of Quantile Regression Coefficient Functions With Longitudinal Data Abstract: In ordinary quantile regression, quantiles of different order are estimated one at a time. An alternative approach, which is referred to as quantile regression coefficients modeling (qrcm), is to model quantile regression coefficients as parametric functions of the order of the quantile. In this article, we describe how the qrcm paradigm can be applied to longitudinal data. We introduce a two-level quantile function, in which two different quantile regression models are used to describe the (conditional) distribution of the within-subject response and that of the individual effects. We propose a novel type of penalized fixed-effects estimator, and discuss its advantages over standard methods based on l1 and l2 penalization. We provide model identifiability conditions, derive asymptotic properties, describe goodness-of-fit measures and model selection criteria, present simulation results, and discuss an application. The proposed method has been implemented in the R package qrcm. Journal: Journal of the American Statistical Association Pages: 783-797 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2021.1892702 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1892702 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:783-797 Template-Type: ReDIF-Article 1.0 Author-Name: Dean Eckles Author-X-Name-First: Dean Author-X-Name-Last: Eckles Author-Name: Eytan Bakshy Author-X-Name-First: Eytan Author-X-Name-Last: Bakshy Title: Bias and High-Dimensional Adjustment in Observational Studies of Peer Effects Abstract: Peer effects, in which an individual’s behavior is affected by peers’ behavior, are posited by multiple theories in the social sciences. Randomized field experiments that identify peer effects, however, are often expensive or infeasible, so many studies of peer effects use observational data, which is expected to suffer from confounding. Here we show, in the context of information and media diffusion, that high-dimensional adjustment of a nonexperimental control group (660 million observations) using propensity score models produces estimates of peer effects statistically indistinguishable from those using a large randomized experiment (215 million observations). Compared with the experiment, naive observational estimators overstate peer effects by over 300% and commonly available variables (e.g., demographics) offer little bias reduction. Adjusting for a measure of prior behaviors closely related to the focal behavior reduces this bias by 91%, while models adjusting for over 3700 past behaviors provide additional bias reduction, reducing bias by over 97%, which is statistically indistinguishable from unbiasedness. This demonstrates how detailed records of behavior can improve studies of social influence, information diffusion, and imitation; these results are encouraging for the credibility of some studies but also cautionary for studies of peer effects in rare or new behaviors. More generally, these results show how large, high-dimensional datasets and statistical learning can be used to improve causal inference. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 507-517 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1796393 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1796393 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:507-517 Template-Type: ReDIF-Article 1.0 Author-Name: Youngjun Choe Author-X-Name-First: Youngjun Author-X-Name-Last: Choe Title: Design of experiments for generalized linear models Journal: Journal of the American Statistical Association Pages: 1038-1038 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2021.1921472 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1921472 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:1038-1038 Template-Type: ReDIF-Article 1.0 Author-Name: Nathan Kallus Author-X-Name-First: Nathan Author-X-Name-Last: Kallus Title: More Efficient Policy Learning via Optimal Retargeting Abstract: Policy learning can be used to extract individualized treatment regimes from observational data in healthcare, civics, e-commerce, and beyond. One big hurdle to policy learning is a commonplace lack of overlap in the data for different actions, which can lead to unwieldy policy evaluation and poorly performing learned policies. We study a solution to this problem based on retargeting, that is, changing the population on which policies are optimized. We first argue that at the population level, retargeting may induce little to no bias. We then characterize the optimal reference policy and retargeting weights in both binary-action and multi-action settings. We do this in terms of the asymptotic efficient estimation variance of the new learning objective. We further consider weights that additionally control for potential bias due to retargeting. Extensive empirical results in a simulation study and a case study of personalized job counseling demonstrate that retargeting is a fairly easy way to significantly improve any policy learning procedure applied to observational data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 646-658 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1788948 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1788948 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:646-658 Template-Type: ReDIF-Article 1.0 Author-Name: Bowei Yan Author-X-Name-First: Bowei Author-X-Name-Last: Yan Author-Name: Purnamrita Sarkar Author-X-Name-First: Purnamrita Author-X-Name-Last: Sarkar Title: Covariate Regularized Community Detection in Sparse Graphs Abstract: In this article, we investigate community detection in networks in the presence of node covariates. In many instances, covariates and networks individually only give a partial view of the cluster structure. One needs to jointly infer the full cluster structure by considering both. In statistics, an emerging body of work has been focused on combining information from both the edges in the network and the node covariates to infer community memberships. However, so far the theoretical guarantees have been established in the dense regime, where the network can lead to perfect clustering under a broad parameter regime, and hence the role of covariates is often not clear. In this article, we examine sparse networks in conjunction with finite dimensional sub-Gaussian mixtures as covariates under moderate separation conditions. In this setting each individual source can only cluster a nonvanishing fraction of nodes correctly. We propose a simple optimization framework which improves clustering accuracy when the two sources carry partial information about the cluster memberships, and hence perform poorly on their own. Our optimization problem can be solved by scalable convex optimization algorithms. With a variety of simulated and real data examples, we show that the proposed method outperforms other existing methodology. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 734-745 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2019.1706541 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1706541 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:734-745 Template-Type: ReDIF-Article 1.0 Author-Name: Cong Ma Author-X-Name-First: Cong Author-X-Name-Last: Ma Author-Name: Junwei Lu Author-X-Name-First: Junwei Author-X-Name-Last: Lu Author-Name: Han Liu Author-X-Name-First: Han Author-X-Name-Last: Liu Title: Inter-Subject Analysis: A Partial Gaussian Graphical Model Approach Abstract: Different from traditional intra-subject analysis, the goal of inter-subject analysis (ISA) is to explore the dependency structure between different subjects with the intra-subject dependency as nuisance. ISA has important applications in neuroscience to study the functional connectivity between brain regions under natural stimuli. We propose a modeling framework for ISA that is based on Gaussian graphical models, under which ISA can be converted to the problem of estimation and inference of a partial Gaussian graphical model. The main statistical challenge is that we do not impose sparsity constraints on the whole precision matrix and we only assume the inter-subject part is sparse. For estimation, we propose to estimate an alternative parameter to get around the nonsparse issue and it can achieve asymptotic consistency even if the intra-subject dependency is dense. For inference, we propose an “untangle and chord” procedure to de-bias our estimator. It is valid without the sparsity assumption on the inverse Hessian of the log-likelihood function. This inferential method is general and can be applied to many other statistical problems, thus it is of independent theoretical interest. Numerical experiments on both simulated and brain imaging data validate our methods and theory. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 746-755 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1841645 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1841645 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:746-755 Template-Type: ReDIF-Article 1.0 Author-Name: Kevin Z. Lin Author-X-Name-First: Kevin Z. Author-X-Name-Last: Lin Author-Name: Jing Lei Author-X-Name-First: Jing Author-X-Name-Last: Lei Author-Name: Kathryn Roeder Author-X-Name-First: Kathryn Author-X-Name-Last: Roeder Title: Rejoinder for “Exponential-Family Embedding With Application to Cell Developmental Trajectories for Single-Cell RNA-Seq Data” Journal: Journal of the American Statistical Association Pages: 478-480 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2021.1892701 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1892701 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:478-480 Template-Type: ReDIF-Article 1.0 Author-Name: Weibin Mo Author-X-Name-First: Weibin Author-X-Name-Last: Mo Author-Name: Zhengling Qi Author-X-Name-First: Zhengling Author-X-Name-Last: Qi Author-Name: Yufeng Liu Author-X-Name-First: Yufeng Author-X-Name-Last: Liu Title: Learning Optimal Distributionally Robust Individualized Treatment Rules Abstract: Recent development in the data-driven decision science has seen great advances in individualized decision making. Given data with individual covariates, treatment assignments and outcomes, policy makers best individualized treatment rule (ITR) that maximizes the expected outcome, known as the value function. Many existing methods assume that the training and testing distributions are the same. However, the estimated optimal ITR may have poor generalizability when the training and testing distributions are not identical. In this article, we consider the problem of finding an optimal ITR from a restricted ITR class where there are some unknown covariate changes between the training and testing distributions. We propose a novel distributionally robust ITR (DR-ITR) framework that maximizes the worst-case value function across the values under a set of underlying distributions that are “close” to the training distribution. The resulting DR-ITR can guarantee the performance among all such distributions reasonably well. We further propose a calibrating procedure that tunes the DR-ITR adaptively to a small amount of calibration data from a target population. In this way, the calibrated DR-ITR can be shown to enjoy better generalizability than the standard ITR based on our numerical studies. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 659-674 Issue: 534 Volume: 116 Year: 2021 Month: 4 X-DOI: 10.1080/01621459.2020.1796359 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1796359 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:659-674 Template-Type: ReDIF-Article 1.0 Author-Name: Min-ge Xie Author-X-Name-First: Min-ge Author-X-Name-Last: Xie Author-Name: Zheshi Zheng Author-X-Name-First: Zheshi Author-X-Name-Last: Zheng Title: Discussion of Professor Bradley Efron’s Article on “Prediction, Estimation, and Attribution” Journal: Journal of the American Statistical Association Pages: 667-671 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1762614 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762614 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:667-671 Template-Type: ReDIF-Article 1.0 Author-Name: Antony M. Overstall Author-X-Name-First: Antony M. Author-X-Name-Last: Overstall Author-Name: David C. Woods Author-X-Name-First: David C. Author-X-Name-Last: Woods Author-Name: Ben M. Parker Author-X-Name-First: Ben M. Author-X-Name-Last: Parker Title: Bayesian Optimal Design for Ordinary Differential Equation Models With Application in Biological Science Abstract: Bayesian optimal design is considered for experiments where the response distribution depends on the solution to a system of nonlinear ordinary differential equations. The motivation is an experiment to estimate parameters in the equations governing the transport of amino acids through cell membranes in human placentas. Decision-theoretic Bayesian design of experiments for such nonlinear models is conceptually very attractive, allowing the formal incorporation of prior knowledge to overcome the parameter dependence of frequentist design and being less reliant on asymptotic approximations. However, the necessary approximation and maximization of the, typically analytically intractable, expected utility results in a computationally challenging problem. These issues are further exacerbated if the solution to the differential equations is not available in closed-form. This article proposes a new combination of a probabilistic solution to the equations embedded within a Monte Carlo approximation to the expected utility with cyclic descent of a smooth approximation to find the optimal design. A novel precomputation algorithm reduces the computational burden, making the search for an optimal design feasible for bigger problems. The methods are demonstrated by finding new designs for a number of common models derived from differential equations, and by providing optimal designs for the placenta experiment. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 583-598 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1617154 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1617154 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:583-598 Template-Type: ReDIF-Article 1.0 Author-Name: Pierre E. Jacob Author-X-Name-First: Pierre E. Author-X-Name-Last: Jacob Author-Name: Fredrik Lindsten Author-X-Name-First: Fredrik Author-X-Name-Last: Lindsten Author-Name: Thomas B. Schön Author-X-Name-First: Thomas B. Author-X-Name-Last: Schön Title: Smoothing With Couplings of Conditional Particle Filters Abstract: In state–space models, smoothing refers to the task of estimating a latent stochastic process given noisy measurements related to the process. We propose an unbiased estimator of smoothing expectations. The lack-of-bias property has methodological benefits: independent estimators can be generated in parallel, and CI can be constructed from the central limit theorem to quantify the approximation error. To design unbiased estimators, we combine a generic debiasing technique for Markov chains, with a Markov chain Monte Carlo algorithm for smoothing. The resulting procedure is widely applicable and we show in numerical experiments that the removal of the bias comes at a manageable increase in variance. We establish the validity of the proposed estimators under mild assumptions. Numerical experiments are provided on toy models, including a setting of highly informative observations, and for a realistic Lotka–Volterra model with an intractable transition density. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 721-729 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2018.1548856 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1548856 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:721-729 Template-Type: ReDIF-Article 1.0 Author-Name: Xinyu Zhang Author-X-Name-First: Xinyu Author-X-Name-Last: Zhang Author-Name: Guohua Zou Author-X-Name-First: Guohua Author-X-Name-Last: Zou Author-Name: Hua Liang Author-X-Name-First: Hua Author-X-Name-Last: Liang Author-Name: Raymond J. Carroll Author-X-Name-First: Raymond J. Author-X-Name-Last: Carroll Title: Parsimonious Model Averaging With a Diverging Number of Parameters Abstract: Model averaging generally provides better predictions than model selection, but the existing model averaging methods cannot lead to parsimonious models. Parsimony is an especially important property when the number of parameters is large. To achieve a parsimonious model averaging coefficient estimator, we suggest a novel criterion for choosing weights. Asymptotic properties are derived in two practical scenarios: (i) one or more correct models exist in the candidate model set and (ii) all candidate models are misspecified. Under the former scenario, it is proved that our method can put the weight one to the smallest correct model and the resulting model averaging estimators of coefficients have many zeros and thus lead to a parsimonious model. The asymptotic distribution of the estimators is also provided. Under the latter scenario, prediction is mainly focused on and we prove that the proposed procedure is asymptotically optimal in the sense that its squared prediction loss and risk are asymptotically identical to those of the best—but infeasible—model averaging estimator. Numerical analysis shows the promise of the proposed procedure over existing model averaging and selection methods. Journal: Journal of the American Statistical Association Pages: 972-984 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1604363 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604363 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:972-984 Template-Type: ReDIF-Article 1.0 Author-Name: Yichuan Zhao Author-X-Name-First: Yichuan Author-X-Name-Last: Zhao Title: Empirical Likelihood Methods in Biomedicine and Health Journal: Journal of the American Statistical Association Pages: 1028-1029 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1759986 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1759986 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:1028-1029 Template-Type: ReDIF-Article 1.0 Author-Name: Frederic P. Schoenberg Author-X-Name-First: Frederic P. Author-X-Name-Last: Schoenberg Title: Theory of Spatial Statistics: A Concise Introduction Journal: Journal of the American Statistical Association Pages: 1033-1034 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1759991 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1759991 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:1033-1034 Template-Type: ReDIF-Article 1.0 Author-Name: Jinhan Xie Author-X-Name-First: Jinhan Author-X-Name-Last: Xie Author-Name: Yuanyuan Lin Author-X-Name-First: Yuanyuan Author-X-Name-Last: Lin Author-Name: Xiaodong Yan Author-X-Name-First: Xiaodong Author-X-Name-Last: Yan Author-Name: Niansheng Tang Author-X-Name-First: Niansheng Author-X-Name-Last: Tang Title: Category-Adaptive Variable Screening for Ultra-High Dimensional Heterogeneous Categorical Data Abstract: The populations of interest in modern studies are very often heterogeneous. The population heterogeneity, the qualitative nature of the outcome variable and the high dimensionality of the predictors pose significant challenge in statistical analysis. In this article, we introduce a category-adaptive screening procedure with high-dimensional heterogeneous data, which is to detect category-specific important covariates. The proposal is a model-free approach without any specification of a regression model and an adaptive procedure in the sense that the set of active variables is allowed to vary across different categories, thus making it more flexible to accommodate heterogeneity. For response-selective sampling data, another main discovery of this article is that the proposed method works directly without any modification. Under mild regularity conditions, the newly procedure is shown to possess the sure screening and ranking consistency properties. Simulation studies contain supportive evidence that the proposed method performs well under various settings and it is effective to extract category-specific information. Applications are illustrated with two real datasets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 747-760 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1573734 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1573734 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:747-760 Template-Type: ReDIF-Article 1.0 Author-Name: Bradley Efron Author-X-Name-First: Bradley Author-X-Name-Last: Efron Title: Prediction, Estimation, and Attribution Abstract: The scientific needs and computational limitations of the twentieth century fashioned classical statistical methodology. Both the needs and limitations have changed in the twenty-first, and so has the methodology. Large-scale prediction algorithms—neural nets, deep learning, boosting, support vector machines, random forests—have achieved star status in the popular press. They are recognizable as heirs to the regression tradition, but ones carried out at enormous scale and on titanic datasets. How do these algorithms compare with standard regression techniques such as ordinary least squares or logistic regression? Several key discrepancies will be examined, centering on the differences between prediction and estimation or prediction and attribution (significance testing). Most of the discussion is carried out through small numerical examples. Journal: Journal of the American Statistical Association Pages: 636-655 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1762613 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762613 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:636-655 Template-Type: ReDIF-Article 1.0 Author-Name: Robin Henderson Author-X-Name-First: Robin Author-X-Name-Last: Henderson Author-Name: Irina Makarenko Author-X-Name-First: Irina Author-X-Name-Last: Makarenko Author-Name: Paul Bushby Author-X-Name-First: Paul Author-X-Name-Last: Bushby Author-Name: Andrew Fletcher Author-X-Name-First: Andrew Author-X-Name-Last: Fletcher Author-Name: Anvar Shukurov Author-X-Name-First: Anvar Author-X-Name-Last: Shukurov Title: Statistical Topology and the Random Interstellar Medium Abstract: We use topological methods to investigate the small-scale variation and local spatial characteristics of the interstellar medium (ISM) in three regions of the southern sky. We demonstrate that there are circumstances where topological methods can identify differences in distributions when conventional marginal or correlation analyses may not. We propose a nonparametric method for comparing two fields based on the counts of topological features and the geometry of the associated persistence diagrams. We investigate the expected distribution of topological structures quantified through Betti numbers under Gaussian random field (GRF) assumptions, which underlie many astrophysical models of the ISM. When we apply the methods to the astrophysical data, we find strong evidence that one of the three regions is both topologically dissimilar to the other two and not consistent with an underlying GRF model. This region is proximal to a region of recent star formation whereas the others are more distant. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 625-635 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1647841 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1647841 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:625-635 Template-Type: ReDIF-Article 1.0 Author-Name: Jerome Friedman Author-X-Name-First: Jerome Author-X-Name-Last: Friedman Author-Name: Trevor Hastie Author-X-Name-First: Trevor Author-X-Name-Last: Hastie Author-Name: Robert Tibshirani Author-X-Name-First: Robert Author-X-Name-Last: Tibshirani Title: Discussion of “Prediction, Estimation, and Attribution” by Bradley Efron Abstract: Professor Efron has presented us with a thought-provoking paper on the relationship between prediction, estimation, and attribution in the modern era of data science. While we appreciate many of his arguments, we see more of a continuum between the old and new methodology, and the opportunity for both to improve through their synergy. Journal: Journal of the American Statistical Association Pages: 665-666 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1762617 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762617 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:665-666 Template-Type: ReDIF-Article 1.0 Author-Name: Briana J. K. Stephenson Author-X-Name-First: Briana J. K. Author-X-Name-Last: Stephenson Author-Name: Amy H. Herring Author-X-Name-First: Amy H. Author-X-Name-Last: Herring Author-Name: Andrew Olshan Author-X-Name-First: Andrew Author-X-Name-Last: Olshan Title: Robust Clustering With Subpopulation-Specific Deviations Abstract: The National Birth Defects Prevention Study (NBDPS) is a case-control study of birth defects conducted across 10 U.S. states. Researchers are interested in characterizing the etiologic role of maternal diet, collected using a food frequency questionnaire. Because diet is multidimensional, dimension reduction methods such as cluster analysis are often used to summarize dietary patterns. In a large, heterogeneous population, traditional clustering methods, such as latent class analysis, used to estimate dietary patterns can produce a large number of clusters due to a variety of factors, including study size and regional diversity. These factors result in a loss of interpretability of patterns that may differ due to minor consumption changes. Based on adaptation of the local partition process, we propose a new method, robust profile clustering, to handle these data complexities. Here, participants may be clustered at two levels: (1) globally, where women are assigned to an overall population-level cluster via an overfitted finite mixture model, and (2) locally, where regional variations in diet are accommodated via a beta-Bernoulli process dependent on subpopulation differences. We use our method to analyze the NBDPS data, deriving prepregnancy dietary patterns for women in the NBDPS while accounting for regional variability. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 521-537 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1611583 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1611583 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:521-537 Template-Type: ReDIF-Article 1.0 Author-Name: Giacomo Zanella Author-X-Name-First: Giacomo Author-X-Name-Last: Zanella Title: Informed Proposals for Local MCMC in Discrete Spaces Abstract: There is a lack of methodological results to design efficient Markov chain Monte Carlo ( MCMC) algorithms for statistical models with discrete-valued high-dimensional parameters. Motivated by this consideration, we propose a simple framework for the design of informed MCMC proposals (i.e., Metropolis–Hastings proposal distributions that appropriately incorporate local information about the target) which is naturally applicable to discrete spaces. Using Peskun-type comparisons of Markov kernels, we explicitly characterize the class of asymptotically optimal proposal distributions under this framework, which we refer to as locally balanced proposals. The resulting algorithms are straightforward to implement in discrete spaces and provide orders of magnitude improvements in efficiency compared to alternative MCMC schemes, including discrete versions of Hamiltonian Monte Carlo. Simulations are performed with both simulated and real datasets, including a detailed application to Bayesian record linkage. A direct connection with gradient-based MCMC suggests that locally balanced proposals can be seen as a natural way to extend the latter to discrete spaces. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 852-865 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1585255 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585255 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:852-865 Template-Type: ReDIF-Article 1.0 Author-Name: Fei Jiang Author-X-Name-First: Fei Author-X-Name-Last: Jiang Author-Name: Qing Cheng Author-X-Name-First: Qing Author-X-Name-Last: Cheng Author-Name: Guosheng Yin Author-X-Name-First: Guosheng Author-X-Name-Last: Yin Author-Name: Haipeng Shen Author-X-Name-First: Haipeng Author-X-Name-Last: Shen Title: Functional Censored Quantile Regression Abstract: We propose a functional censored quantile regression model to describe the time-varying relationship between time-to-event outcomes and corresponding functional covariates. The time-varying effect is modeled as an unspecified function that is approximated via B-splines. A generalized approximate cross-validation method is developed to select the number of knots by minimizing the expected loss. We establish asymptotic properties of the method and the knot selection procedure. Furthermore, we conduct extensive simulation studies to evaluate the finite sample performance of our method. Finally, we analyze the functional relationship between ambulatory blood pressure trajectories and clinical outcome in stroke patients. The results reinforce the importance of the morning blood pressure surge phenomenon, whose effect has caught attention but remains controversial in the medical literature. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 931-944 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1602047 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1602047 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:931-944 Template-Type: ReDIF-Article 1.0 Author-Name: Jin-Ting Zhang Author-X-Name-First: Jin-Ting Author-X-Name-Last: Zhang Author-Name: Jia Guo Author-X-Name-First: Jia Author-X-Name-Last: Guo Author-Name: Bu Zhou Author-X-Name-First: Bu Author-X-Name-Last: Zhou Author-Name: Ming-Yen Cheng Author-X-Name-First: Ming-Yen Author-X-Name-Last: Cheng Title: A Simple Two-Sample Test in High Dimensions Based on L2-Norm Abstract: Testing the equality of two means is a fundamental inference problem. For high-dimensional data, the Hotelling’s T2-test either performs poorly or becomes inapplicable. Several modifications have been proposed to address this issue. However, most of them are based on asymptotic normality of the null distributions of their test statistics which inevitably requires strong assumptions on the covariance. We study this problem thoroughly and propose an L2-norm based test that works under mild conditions and even when there are fewer observations than the dimension. Specially, to cope with general nonnormality of the null distribution we employ the Welch–Satterthwaite χ2-approximation. We derive a sharp upper bound on the approximation error and use it to justify that χ2-approximation is preferred to normal approximation. Simple ratio-consistent estimators for the parameters in the χ2-approximation are given. Importantly, our test can cope with singularity or near singularity of the covariance which is commonly seen in high dimensions and is the main cause of nonnormality. The power of the proposed test is also investigated. Extensive simulation studies and an application show that our test is at least comparable to and often outperforms several competitors in terms of size control, and the powers are comparable when their sizes are. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1011-1027 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1604366 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604366 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:1011-1027 Template-Type: ReDIF-Article 1.0 Author-Name: Chih-Li Sung Author-X-Name-First: Chih-Li Author-X-Name-Last: Sung Author-Name: Ying Hung Author-X-Name-First: Ying Author-X-Name-Last: Hung Author-Name: William Rittase Author-X-Name-First: William Author-X-Name-Last: Rittase Author-Name: Cheng Zhu Author-X-Name-First: Cheng Author-X-Name-Last: Zhu Author-Name: C. F. Jeff Wu Author-X-Name-First: C. F. Author-X-Name-Last: Jeff Wu Title: A Generalized Gaussian Process Model for Computer Experiments With Binary Time Series Abstract: Non-Gaussian observations such as binary responses are common in some computer experiments. Motivated by the analysis of a class of cell adhesion experiments, we introduce a generalized Gaussian process model for binary responses, which shares some common features with standard GP models. In addition, the proposed model incorporates a flexible mean function that can capture different types of time series structures. Asymptotic properties of the estimators are derived, and an optimal predictor as well as its predictive distribution are constructed. Their performance is examined via two simulation studies. The methodology is applied to study computer simulations for cell adhesion experiments. The fitted model reveals important biological information in repeated cell bindings, which is not directly observable in lab experiments. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 945-956 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1604361 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604361 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:945-956 Template-Type: ReDIF-Article 1.0 Author-Name: Zhiliang Ying Author-X-Name-First: Zhiliang Author-X-Name-Last: Ying Author-Name: Wen Yu Author-X-Name-First: Wen Author-X-Name-Last: Yu Author-Name: Ziqiang Zhao Author-X-Name-First: Ziqiang Author-X-Name-Last: Zhao Author-Name: Ming Zheng Author-X-Name-First: Ming Author-X-Name-Last: Zheng Title: Regression Analysis of Doubly Truncated Data Abstract: Doubly truncated data are found in astronomy, econometrics, and survival analysis literature. They arise when each observation is confined to an interval, that is, only those which fall within their respective intervals are observed along with the intervals. Unlike the one-sided truncation that can be handled by counting process-based approach, doubly truncated data are much more difficult to handle. In their analysis of an astronomical dataset, Efron and Petrosian proposed some nonparametric methods for doubly truncated data. Motivated by their approach, as well as by the work of Bhattacharya et al. for right truncated data, we propose a general method for estimating the regression parameter when the dependent variable is subject to the double truncation. It extends the Mann–Whitney-type rank estimator and can be computed easily by existing software packages. Weighted rank estimation is also considered for improving estimation efficiency. We show that the resulting estimators are consistent and asymptotically normal. Resampling schemes are proposed with large sample justification for approximating the limiting distributions. The quasar data in Efron and Petrosian and an AIDS incubation data are analyzed by the new method. Simulation results show that the proposed method works well. Journal: Journal of the American Statistical Association Pages: 810-821 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1585252 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585252 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:810-821 Template-Type: ReDIF-Article 1.0 Author-Name: A. C. Davison Author-X-Name-First: A. C. Author-X-Name-Last: Davison Title: Discussion Journal: Journal of the American Statistical Association Pages: 663-664 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1762616 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762616 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:663-664 Template-Type: ReDIF-Article 1.0 Author-Name: Yen-Chi Chen Author-X-Name-First: Yen-Chi Author-X-Name-Last: Chen Title: Statistical Modelling by Exponential Families Journal: Journal of the American Statistical Association Pages: 1032-1032 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1759989 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1759989 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:1032-1032 Template-Type: ReDIF-Article 1.0 Author-Name: Bin Yu Author-X-Name-First: Bin Author-X-Name-Last: Yu Author-Name: Rebecca Barter Author-X-Name-First: Rebecca Author-X-Name-Last: Barter Title: The Data Science Process: One Culture Journal: Journal of the American Statistical Association Pages: 672-674 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1762615 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762615 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:672-674 Template-Type: ReDIF-Article 1.0 Author-Name: Jialiang Mao Author-X-Name-First: Jialiang Author-X-Name-Last: Mao Author-Name: Yuhan Chen Author-X-Name-First: Yuhan Author-X-Name-Last: Chen Author-Name: Li Ma Author-X-Name-First: Li Author-X-Name-Last: Ma Title: Bayesian Graphical Compositional Regression for Microbiome Data Abstract: An important task in microbiome studies is to test the existence of and give characterization to differences in the microbiome composition across groups of samples. Important challenges of this problem include the large within-group heterogeneities among samples and the existence of potential confounding variables that, when ignored, increase the chance of false discoveries and reduce the power for identifying true differences. We propose a probabilistic framework to overcome these issues by combining three ideas: (i) a phylogenetic tree-based decomposition of the cross-group comparison problem into a series of local tests, (ii) a graphical model that links the local tests to allow information sharing across taxa, and (iii) a Bayesian testing strategy that incorporates covariates and integrates out the within-group variation, avoiding potentially unstable point estimates. With the proposed method, we analyze the American Gut data to compare the gut microbiome composition of groups of participants with different dietary habits. Our analysis shows that (i) the frequency of consuming fruit, seafood, vegetable, and whole grain are closely related to the gut microbiome composition and (ii) the conclusion of the analysis can change drastically when different sets of relevant covariates are adjusted, indicating the necessity of carefully selecting and including possible confounders in the analysis when comparing microbiome compositions with data from observational studies. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 610-624 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1647212 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1647212 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:610-624 Template-Type: ReDIF-Article 1.0 Author-Name: Kyunghee Han Author-X-Name-First: Kyunghee Author-X-Name-Last: Han Author-Name: Hans-Georg Müller Author-X-Name-First: Hans-Georg Author-X-Name-Last: Müller Author-Name: Byeong U. Park Author-X-Name-First: Byeong U. Author-X-Name-Last: Park Title: Additive Functional Regression for Densities as Responses Abstract: We propose and investigate additive density regression, a novel additive functional regression model for situations where the responses are random distributions that can be viewed as random densities and the predictors are vectors. Data in the form of samples of densities or distributions are increasingly encountered in statistical analysis and there is a need for flexible regression models that accommodate random densities as responses. Such models are of special interest for multivariate continuous predictors, where unrestricted nonparametric regression approaches are subject to the curse of dimensionality. Additive models can be expected to maintain one-dimensional rates of convergence while permitting a substantial degree of flexibility. This motivates the development of additive regression models for situations where multivariate continuous predictors are coupled with densities as responses. To overcome the problem that distributions do not form a vector space, we utilize a class of transformations that map densities to unrestricted square integrable functions and then deploy an additive functional regression model to fit the responses in the unrestricted space, finally transforming back to density space. We implement the proposed additive model with an extended version of smooth backfitting and establish the consistency of this approach, including rates of convergence. The proposed method is illustrated with an application to the distributions of baby names in the United States. Journal: Journal of the American Statistical Association Pages: 997-1010 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1604365 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604365 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:997-1010 Template-Type: ReDIF-Article 1.0 Author-Name: Amanda F. Mejia Author-X-Name-First: Amanda F. Author-X-Name-Last: Mejia Author-Name: Yu (Ryan) Yue Author-X-Name-First: Yu (Ryan) Author-X-Name-Last: Yue Author-Name: David Bolin Author-X-Name-First: David Author-X-Name-Last: Bolin Author-Name: Finn Lindgren Author-X-Name-First: Finn Author-X-Name-Last: Lindgren Author-Name: Martin A. Lindquist Author-X-Name-First: Martin A. Author-X-Name-Last: Lindquist Title: A Bayesian General Linear Modeling Approach to Cortical Surface fMRI Data Analysis Abstract: Cortical surface functional magnetic resonance imaging (cs-fMRI) has recently grown in popularity versus traditional volumetric fMRI. In addition to offering better whole-brain visualization, dimension reduction, removal of extraneous tissue types, and improved alignment of cortical areas across subjects, it is also more compatible with common assumptions of Bayesian spatial models. However, as no spatial Bayesian model has been proposed for cs-fMRI data, most analyses continue to employ the classical general linear model (GLM), a “massive univariate” approach. Here, we propose a spatial Bayesian GLM for cs-fMRI, which employs a class of sophisticated spatial processes to model latent activation fields. We make several advances compared with existing spatial Bayesian models for volumetric fMRI. First, we use integrated nested Laplacian approximations, a highly accurate and efficient Bayesian computation technique, rather than variational Bayes. To identify regions of activation, we utilize an excursions set method based on the joint posterior distribution of the latent fields, rather than the marginal distribution at each location. Finally, we propose the first multi-subject spatial Bayesian modeling approach, which addresses a major gap in the existing literature. The methods are very computationally advantageous and are validated through simulation studies and two task fMRI studies from the Human Connectome Project.Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 501-520 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1611582 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1611582 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:501-520 Template-Type: ReDIF-Article 1.0 Author-Name: Shan Yu Author-X-Name-First: Shan Author-X-Name-Last: Yu Author-Name: Guannan Wang Author-X-Name-First: Guannan Author-X-Name-Last: Wang Author-Name: Li Wang Author-X-Name-First: Li Author-X-Name-Last: Wang Author-Name: Chenhui Liu Author-X-Name-First: Chenhui Author-X-Name-Last: Liu Author-Name: Lijian Yang Author-X-Name-First: Lijian Author-X-Name-Last: Yang Title: Estimation and Inference for Generalized Geoadditive Models Abstract: In many application areas, data are collected on a count or binary response with spatial covariate information. In this article, we introduce a new class of generalized geoadditive models (GGAMs) for spatial data distributed over complex domains. Through a link function, the proposed GGAM assumes that the mean of the discrete response variable depends on additive univariate functions of explanatory variables and a bivariate function to adjust for the spatial effect. We propose a two-stage approach for estimating and making inferences of the components in the GGAM. In the first stage, the univariate components and the geographical component in the model are approximated via univariate polynomial splines and bivariate penalized splines over triangulation, respectively. In the second stage, local polynomial smoothing is applied to the cleaned univariate data to average out the variation of the first-stage estimators. We investigate the consistency of the proposed estimators and the asymptotic normality of the univariate components. We also establish the simultaneous confidence band for each of the univariate components. The performance of the proposed method is evaluated by two simulation studies. We apply the proposed method to analyze the crash counts data in the Tampa-St. Petersburg urbanized area in Florida. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 761-774 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1574584 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1574584 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:761-774 Template-Type: ReDIF-Article 1.0 Author-Name: Zhengling Qi Author-X-Name-First: Zhengling Author-X-Name-Last: Qi Author-Name: Dacheng Liu Author-X-Name-First: Dacheng Author-X-Name-Last: Liu Author-Name: Haoda Fu Author-X-Name-First: Haoda Author-X-Name-Last: Fu Author-Name: Yufeng Liu Author-X-Name-First: Yufeng Author-X-Name-Last: Liu Title: Multi-Armed Angle-Based Direct Learning for Estimating Optimal Individualized Treatment Rules With Various Outcomes Abstract: Estimating an optimal individualized treatment rule (ITR) based on patients’ information is an important problem in precision medicine. An optimal ITR is a decision function that optimizes patients’ expected clinical outcomes. Many existing methods in the literature are designed for binary treatment settings with the interest of a continuous outcome. Much less work has been done on estimating optimal ITRs in multiple treatment settings with good interpretations. In this article, we propose angle-based direct learning (AD-learning) to efficiently estimate optimal ITRs with multiple treatments. Our proposed method can be applied to various types of outcomes, such as continuous, survival, or binary outcomes. Moreover, it has an interesting geometric interpretation on the effect of different treatments for each individual patient, which can help doctors and patients make better decisions. Finite sample error bounds have been established to provide a theoretical guarantee for AD-learning. Finally, we demonstrate the superior performance of our method via an extensive simulation study and real data applications. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 678-691 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2018.1529597 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1529597 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:678-691 Template-Type: ReDIF-Article 1.0 Author-Name: Bradley Efron Author-X-Name-First: Bradley Author-X-Name-Last: Efron Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 675-677 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1762453 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762453 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:675-677 Template-Type: ReDIF-Article 1.0 Author-Name: Wenjia Wang Author-X-Name-First: Wenjia Author-X-Name-Last: Wang Author-Name: Rui Tuo Author-X-Name-First: Rui Author-X-Name-Last: Tuo Author-Name: C. F. Jeff Wu Author-X-Name-First: C. F. Author-X-Name-Last: Jeff Wu Title: On Prediction Properties of Kriging: Uniform Error Bounds and Robustness Abstract: Kriging based on Gaussian random fields is widely used in reconstructing unknown functions. The kriging method has pointwise predictive distributions which are computationally simple. However, in many applications one would like to predict for a range of untried points simultaneously. In this work, we obtain some error bounds for the simple and universal kriging predictor under the uniform metric. It works for a scattered set of input points in an arbitrary dimension, and also covers the case where the covariance function of the Gaussian process is misspecified. These results lead to a better understanding of the rate of convergence of kriging under the Gaussian or the Matérn correlation functions, the relationship between space-filling designs and kriging models, and the robustness of the Matérn correlation functions. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 920-930 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1598868 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1598868 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:920-930 Template-Type: ReDIF-Article 1.0 Author-Name: Lu Yang Author-X-Name-First: Lu Author-X-Name-Last: Yang Author-Name: Edward W. Frees Author-X-Name-First: Edward W. Author-X-Name-Last: Frees Author-Name: Zhengjun Zhang Author-X-Name-First: Zhengjun Author-X-Name-Last: Zhang Title: Nonparametric Estimation of Copula Regression Models With Discrete Outcomes Abstract: Multivariate discrete outcomes are common in a wide range of areas including insurance, finance, and biology. When the interplay between outcomes is significant, quantifying dependencies among interrelated variables is of great importance. Due to their ability to accommodate dependence flexibly, copulas are being applied increasingly. Yet, the application of copulas on discrete data is still in its infancy; one of the biggest barriers is the nonuniqueness of copulas, calling into question model interpretations and predictions. In this article, we study copula estimation with discrete outcomes in a regression context. As the marginal distributions vary with covariates, inclusion of continuous regressors expands the region of support for consistent estimation of copulas. Because some properties of continuous outcomes do not carry over to discrete outcomes, specification of a copula model has been a problem. We propose a nonparametric estimator of copulas to identify the “hidden” dependence structure for discrete outcomes and develop its asymptotic properties. The proposed nonparametric estimator can also serve as a diagnostic tool for selecting a parametric form for copulas. In the simulation study, we explore the performance of the proposed estimator under different scenarios and provide guidance on when the choice of copulas is important. The performance of the estimator improves as discreteness diminishes. A practical bandwidth selector is also proposed. An empirical analysis examines a dataset from the Local Government Property Insurance Fund (LGPIF) in the state of Wisconsin. We apply the nonparametric estimator to model the dependence among claim frequencies from different types of insurance coverage. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 707-720 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2018.1546586 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1546586 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:707-720 Template-Type: ReDIF-Article 1.0 Author-Name: Han Li Author-X-Name-First: Han Author-X-Name-Last: Li Author-Name: Minxuan Xu Author-X-Name-First: Minxuan Author-X-Name-Last: Xu Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Author-Name: Xiaodan Fan Author-X-Name-First: Xiaodan Author-X-Name-Last: Fan Title: An Extended Mallows Model for Ranked Data Aggregation Abstract: In this article, we study the rank aggregation problem, which aims to find a consensus ranking by aggregating multiple ranking lists. To address the problem probabilistically, we formulate an elaborate ranking model for full and partial rankings by generalizing the Mallows model. Our model assumes that the ranked data are generated through a multistage ranking process that is explicitly governed by parameters that measure the overall quality and stability of the process. The new model is quite flexible and has a closed form expression. Under mild conditions, we can derive a few useful theoretical properties of the model. Furthermore, we propose an efficient statistic called rank coefficient to detect over-correlated rankings and a hierarchical ranking model to fit the data. Through extensive simulation studies and real applications, we evaluate the merits of our models and demonstrate that they outperform the state-of-the-art methods in diverse scenarios. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 730-746 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1573733 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1573733 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:730-746 Template-Type: ReDIF-Article 1.0 Author-Name: Xiwei Tang Author-X-Name-First: Xiwei Author-X-Name-Last: Tang Author-Name: Xuan Bi Author-X-Name-First: Xuan Author-X-Name-Last: Bi Author-Name: Annie Qu Author-X-Name-First: Annie Author-X-Name-Last: Qu Title: Individualized Multilayer Tensor Learning With an Application in Imaging Analysis Abstract: This work is motivated by multimodality breast cancer imaging data, which is quite challenging in that the signals of discrete tumor-associated microvesicles are randomly distributed with heterogeneous patterns. This imposes a significant challenge for conventional imaging regression and dimension reduction models assuming a homogeneous feature structure. We develop an innovative multilayer tensor learning method to incorporate heterogeneity to a higher-order tensor decomposition and predict disease status effectively through utilizing subject-wise imaging features and multimodality information. Specifically, we construct a multilayer decomposition which leverages an individualized imaging layer in addition to a modality-specific tensor structure. One major advantage of our approach is that we are able to efficiently capture the heterogeneous spatial features of signals that are not characterized by a population structure as well as integrating multimodality information simultaneously. To achieve scalable computing, we develop a new bi-level block improvement algorithm. In theory, we investigate both the algorithm convergence property, tensor signal recovery error bound and asymptotic consistency for prediction model estimation. We also apply the proposed method for simulated and human breast cancer imaging data. Numerical results demonstrate that the proposed method outperforms other existing competing methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 836-851 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1585254 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585254 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:836-851 Template-Type: ReDIF-Article 1.0 Author-Name: Degui Li Author-X-Name-First: Degui Author-X-Name-Last: Li Author-Name: Peter M. Robinson Author-X-Name-First: Peter M. Author-X-Name-Last: Robinson Author-Name: Han Lin Shang Author-X-Name-First: Han Lin Author-X-Name-Last: Shang Title: Long-Range Dependent Curve Time Series Abstract: We introduce methods and theory for functional or curve time series with long-range dependence. The temporal sum of the curve process is shown to be asymptotically normally distributed, the conditions for this covering a functional version of fractionally integrated autoregressive moving averages. We also construct an estimate of the long-run covariance function, which we use, via functional principal component analysis, in estimating the orthonormal functions spanning the dominant subspace of the curves. In a semiparametric context, we propose an estimate of the memory parameter and establish its consistency. A Monte Carlo study of finite-sample performance is included, along with two empirical applications. The first of these finds a degree of stability and persistence in intraday stock returns. The second finds similarity in the extent of long memory in incremental age-specific fertility rates across some developed nations. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 957-971 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1604362 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604362 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:957-971 Template-Type: ReDIF-Article 1.0 Author-Name: Trambak Banerjee Author-X-Name-First: Trambak Author-X-Name-Last: Banerjee Author-Name: Gourab Mukherjee Author-X-Name-First: Gourab Author-X-Name-Last: Mukherjee Author-Name: Shantanu Dutta Author-X-Name-First: Shantanu Author-X-Name-Last: Dutta Author-Name: Pulak Ghosh Author-X-Name-First: Pulak Author-X-Name-Last: Ghosh Title: A Large-Scale Constrained Joint Modeling Approach for Predicting User Activity, Engagement, and Churn With Application to Freemium Mobile Games Abstract: We develop a constrained extremely zero inflated joint (CEZIJ) modeling framework for simultaneously analyzing player activity, engagement, and dropouts (churns) in app-based mobile freemium games. Our proposed framework addresses the complex interdependencies between a player’s decision to use a freemium product, the extent of her direct and indirect engagement with the product and her decision to permanently drop its usage. CEZIJ extends the existing class of joint models for longitudinal and survival data in several ways. It not only accommodates extremely zero-inflated responses in a joint model setting but also incorporates domain-specific, convex structural constraints on the model parameters. Longitudinal data from app-based mobile games usually exhibit a large set of potential predictors and choosing the relevant set of predictors is highly desirable for various purposes including improved predictability. To achieve this goal, CEZIJ conducts simultaneous, coordinated selection of fixed and random effects in high-dimensional penalized generalized linear mixed models. For analyzing such large-scale datasets, variable selection and estimation are conducted via a distributed computing based split-and-conquer approach that massively increases scalability and provides better predictive performance over competing predictive methods. Our results reveal codependencies between varied player characteristics that promote player activity and engagement. Furthermore, the predicted churn probabilities exhibit idiosyncratic clusters of player profiles over time based on which marketers and game managers can segment the playing population for improved monetization of app-based freemium games. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 538-554 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1611584 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1611584 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:538-554 Template-Type: ReDIF-Article 1.0 Author-Name: Ian Laga Author-X-Name-First: Ian Author-X-Name-Last: Laga Author-Name: Xiaoyue Niu Author-X-Name-First: Xiaoyue Author-X-Name-Last: Niu Title: Model-Based Geostatistics for Global Public Health: Methods and Applications Journal: Journal of the American Statistical Association Pages: 1030-1032 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1759988 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1759988 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:1030-1032 Template-Type: ReDIF-Article 1.0 Author-Name: Giampiero Marra Author-X-Name-First: Giampiero Author-X-Name-Last: Marra Author-Name: Rosalba Radice Author-X-Name-First: Rosalba Author-X-Name-Last: Radice Title: Copula Link-Based Additive Models for Right-Censored Event Time Data Abstract: This article proposes an approach to estimate and make inference on the parameters of copula link-based survival models. The methodology allows for the margins to be specified using flexible parametric formulations for time-to-event data, the baseline survival functions to be modeled using monotonic splines, and each parameter of the assumed joint survival distribution to depend on an additive predictor incorporating several types of covariate effects. All the model’s coefficients as well as the smoothing parameters associated with the relevant components in the additive predictors are estimated using a carefully structured efficient and stable penalized likelihood algorithm. Some theoretical properties are also discussed. The proposed modeling framework is evaluated in a simulation study and illustrated using a real dataset. The relevant numerical computations can be easily carried out using the freely available GJRM R package. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 886-895 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1593178 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1593178 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:886-895 Template-Type: ReDIF-Article 1.0 Author-Name: Matthias Katzfuss Author-X-Name-First: Matthias Author-X-Name-Last: Katzfuss Author-Name: Jonathan R. Stroud Author-X-Name-First: Jonathan R. Author-X-Name-Last: Stroud Author-Name: Christopher K. Wikle Author-X-Name-First: Christopher K. Author-X-Name-Last: Wikle Title: Ensemble Kalman Methods for High-Dimensional Hierarchical Dynamic Space-Time Models Abstract: We propose a new class of filtering and smoothing methods for inference in high-dimensional, nonlinear, non-Gaussian, spatio-temporal state-space models. The main idea is to combine the ensemble Kalman filter and smoother, developed in the geophysics literature, with state-space algorithms from the statistics literature. Our algorithms address a variety of estimation scenarios, including online and off-line state and parameter estimation. We take a Bayesian perspective, for which the goal is to generate samples from the joint posterior distribution of states and parameters. The key benefit of our approach is the use of ensemble Kalman methods for dimension reduction, which allows inference for high-dimensional state vectors. We compare our methods to existing ones, including ensemble Kalman filters, particle filters, and particle MCMC. Using a real data example of cloud motion and data simulated under a number of nonlinear and non-Gaussian scenarios, we show that our approaches outperform these existing methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 866-885 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1592753 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1592753 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:866-885 Template-Type: ReDIF-Article 1.0 Author-Name: Lin Su Author-X-Name-First: Lin Author-X-Name-Last: Su Author-Name: Wenbin Lu Author-X-Name-First: Wenbin Author-X-Name-Last: Lu Author-Name: Rui Song Author-X-Name-First: Rui Author-X-Name-Last: Song Author-Name: Danyang Huang Author-X-Name-First: Danyang Author-X-Name-Last: Huang Title: Testing and Estimation of Social Network Dependence With Time to Event Data Abstract: Nowadays, events are spread rapidly along social networks. We are interested in whether people’s responses to an event are affected by their friends’ characteristics. For example, how soon will a person start playing a game given that his/her friends like it? Studying social network dependence is an emerging research area. In this work, we propose a novel latent spatial autocorrelation Cox model to study social network dependence with time-to-event data. The proposed model introduces a latent indicator to characterize whether a person’s survival time might be affected by his or her friends’ features. We first propose a score-type test for detecting the existence of social network dependence. If it exists, we further develop an EM-type algorithm to estimate the model parameters. The performance of the proposed test and estimators are illustrated by simulation studies and an application to a time-to-event dataset about playing a popular mobile game from one of the largest online social network platforms. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 570-582 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1617153 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1617153 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:570-582 Template-Type: ReDIF-Article 1.0 Author-Name: Neal S. Grantham Author-X-Name-First: Neal S. Author-X-Name-Last: Grantham Author-Name: Yawen Guan Author-X-Name-First: Yawen Author-X-Name-Last: Guan Author-Name: Brian J. Reich Author-X-Name-First: Brian J. Author-X-Name-Last: Reich Author-Name: Elizabeth T. Borer Author-X-Name-First: Elizabeth T. Author-X-Name-Last: Borer Author-Name: Kevin Gross Author-X-Name-First: Kevin Author-X-Name-Last: Gross Title: MIMIX: A Bayesian Mixed-Effects Model for Microbiome Data From Designed Experiments Abstract: Recent advances in bioinformatics have made high-throughput microbiome data widely available, and new statistical tools are required to maximize the information gained from these data. For example, analysis of high-dimensional microbiome data from designed experiments remains an open area in microbiome research. Contemporary analyses work on metrics that summarize collective properties of the microbiome, but such reductions preclude inference on the fine-scale effects of environmental stimuli on individual microbial taxa. Other approaches model the proportions or counts of individual taxa as response variables in mixed models, but these methods fail to account for complex correlation patterns among microbial communities. In this article, we propose a novel Bayesian mixed-effects model that exploits cross-taxa correlations within the microbiome, a model we call microbiome mixed model (MIMIX). MIMIX offers global tests for treatment effects, local tests and estimation of treatment effects on individual taxa, quantification of the relative contribution from heterogeneous sources to microbiome variability, and identification of latent ecological subcommunities in the microbiome. MIMIX is tailored to large microbiome experiments using a combination of Bayesian factor analysis to efficiently represent dependence between taxa and Bayesian variable selection methods to achieve sparsity. We demonstrate the model using a simulation experiment and on a 2 × 2 factorial experiment of the effects of nutrient supplement and herbivore exclusion on the foliar fungal microbiome of Andropogon gerardii, a perennial bunchgrass, as part of the global Nutrient Network research initiative. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 599-609 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1626242 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1626242 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:599-609 Template-Type: ReDIF-Article 1.0 Author-Name: Chenlu Ke Author-X-Name-First: Chenlu Author-X-Name-Last: Ke Author-Name: Xiangrong Yin Author-X-Name-First: Xiangrong Author-X-Name-Last: Yin Title: Expected Conditional Characteristic Function-based Measures for Testing Independence Abstract: We propose a novel class of independence measures for testing independence between two random vectors based on the discrepancy between the conditional and the marginal characteristic functions. The relation between our index and other similar measures is studied, which indicates that they all belong to a large framework of reproducing kernel Hilbert space. If one of the variables is categorical, our asymmetric index extends the typical ANOVA to a kernel ANOVA that can test a more general hypothesis of equal distributions among groups. In addition, our index is also applicable when both variables are continuous. We develop two empirical estimates and obtain their respective asymptotic distributions. We illustrate the advantages of our approach by numerical studies across a variety of settings including a real data example. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 985-996 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1604364 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604364 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:985-996 Template-Type: ReDIF-Article 1.0 Author-Name: Ionut Bebu Author-X-Name-First: Ionut Author-X-Name-Last: Bebu Title: Innovative Strategies, Statistical Solutions and Simulations for Modern Clinical Trials Journal: Journal of the American Statistical Association Pages: 1029-1030 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1759987 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1759987 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:1029-1030 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Correction Journal: Journal of the American Statistical Association Pages: 1035-1036 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1724472 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1724472 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:1035-1036 Template-Type: ReDIF-Article 1.0 Author-Name: Emmanuel Candès Author-X-Name-First: Emmanuel Author-X-Name-Last: Candès Author-Name: Chiara Sabatti Author-X-Name-First: Chiara Author-X-Name-Last: Sabatti Title: Discussion of the Paper “Prediction, Estimation, and Attribution” by B. Efron Journal: Journal of the American Statistical Association Pages: 656-658 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1762618 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762618 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:656-658 Template-Type: ReDIF-Article 1.0 Author-Name: Daniel J. McDonald Author-X-Name-First: Daniel J. Author-X-Name-Last: McDonald Title: Sufficient Dimension Reduction: Methods and Applications With R Journal: Journal of the American Statistical Association Pages: 1032-1033 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1759990 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1759990 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:1032-1033 Template-Type: ReDIF-Article 1.0 Author-Name: Jean-Noël Bacro Author-X-Name-First: Jean-Noël Author-X-Name-Last: Bacro Author-Name: Carlo Gaetan Author-X-Name-First: Carlo Author-X-Name-Last: Gaetan Author-Name: Thomas Opitz Author-X-Name-First: Thomas Author-X-Name-Last: Opitz Author-Name: Gwladys Toulemonde Author-X-Name-First: Gwladys Author-X-Name-Last: Toulemonde Title: Hierarchical Space-Time Modeling of Asymptotically Independent Exceedances With an Application to Precipitation Data Abstract: The statistical modeling of space-time extremes in environmental applications is key to understanding complex dependence structures in original event data and to generating realistic scenarios for impact models. In this context of high-dimensional data, we propose a novel hierarchical model for high threshold exceedances defined over continuous space and time by embedding a space-time Gamma process convolution for the rate of an exponential variable, leading to asymptotic independence in space and time. Its physically motivated anisotropic dependence structure is based on geometric objects moving through space-time according to a velocity vector. We demonstrate that inference based on weighted pairwise likelihood is fast and accurate. The usefulness of our model is illustrated by an application to hourly precipitation data from a study region in Southern France, where it clearly improves on an alternative censored Gaussian space-time random field model. While classical limit models based on threshold-stability fail to appropriately capture relatively fast joint tail decay rates between asymptotic dependence and classical independence, strong empirical evidence from our application and other recent case studies motivates the use of more realistic asymptotic independence models such as ours. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 555-569 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1617152 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1617152 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:555-569 Template-Type: ReDIF-Article 1.0 Author-Name: Elynn Y. Chen Author-X-Name-First: Elynn Y. Author-X-Name-Last: Chen Author-Name: Ruey S. Tsay Author-X-Name-First: Ruey S. Author-X-Name-Last: Tsay Author-Name: Rong Chen Author-X-Name-First: Rong Author-X-Name-Last: Chen Title: Constrained Factor Models for High-Dimensional Matrix-Variate Time Series Abstract: High-dimensional matrix-variate time series data are becoming widely available in many scientific fields, such as economics, biology, and meteorology. To achieve significant dimension reduction while preserving the intrinsic matrix structure and temporal dynamics in such data, Wang, Liu, and Chen proposed a matrix factor model, that is, shown to be able to provide effective analysis. In this article, we establish a general framework for incorporating domain and prior knowledge in the matrix factor model through linear constraints. The proposed framework is shown to be useful in achieving parsimonious parameterization, facilitating interpretation of the latent matrix factor, and identifying specific factors of interest. Fully utilizing the prior-knowledge-induced constraints results in more efficient and accurate modeling, inference, dimension reduction as well as a clear and better interpretation of the results. Constrained, multi-term, and partially constrained factor models for matrix-variate time series are developed, with efficient estimation procedures and their asymptotic properties. We show that the convergence rates of the constrained factor loading matrices are much faster than those of the conventional matrix factor analysis under many situations. Simulation studies are carried out to demonstrate finite-sample performance of the proposed method and its associated asymptotic properties. We illustrate the proposed model with three applications, where the constrained matrix-factor models outperform their unconstrained counterparts in the power of variance explanation under the out-of-sample 10-fold cross-validation setting. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 775-793 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1584899 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1584899 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:775-793 Template-Type: ReDIF-Article 1.0 Author-Name: Abhijoy Saha Author-X-Name-First: Abhijoy Author-X-Name-Last: Saha Author-Name: Karthik Bharath Author-X-Name-First: Karthik Author-X-Name-Last: Bharath Author-Name: Sebastian Kurtek Author-X-Name-First: Sebastian Author-X-Name-Last: Kurtek Title: A Geometric Variational Approach to Bayesian Inference Abstract: We propose a novel Riemannian geometric framework for variational inference in Bayesian models based on the nonparametric Fisher–Rao metric on the manifold of probability density functions. Under the square-root density representation, the manifold can be identified with the positive orthant of the unit hypersphere S∞ in L2 , and the Fisher–Rao metric reduces to the standard L2 metric. Exploiting such a Riemannian structure, we formulate the task of approximating the posterior distribution as a variational problem on the hypersphere based on the α-divergence. This provides a tighter lower bound on the marginal distribution when compared to, and a corresponding upper bound unavailable with, approaches based on the Kullback–Leibler divergence. We propose a novel gradient-based algorithm for the variational problem based on Fréchet derivative operators motivated by the geometry of S∞ , and examine its properties. Through simulations and real data applications, we demonstrate the utility of the proposed geometric framework and algorithm on several Bayesian models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 822-835 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1585253 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585253 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:822-835 Template-Type: ReDIF-Article 1.0 Author-Name: D. R. Cox Author-X-Name-First: D. R. Author-X-Name-Last: Cox Title: Discussion of Paper by Brad Efron Journal: Journal of the American Statistical Association Pages: 659-659 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1762451 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762451 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:659-659 Template-Type: ReDIF-Article 1.0 Author-Name: Karen Kafadar Author-X-Name-First: Karen Author-X-Name-Last: Kafadar Title: Reinforcing the Impact of Statistics on Society Abstract: What does statistics have to offer science and society, in this age of massive data, machine learning algorithms, and multiple online sources of tools for data analysis? I recall a few situations where statistics made a real difference and reinforced the impact of our discipline on society. Sometimes the difference lay in the insightful analysis and inference enabled by ground-breaking methods in our field like hypothesis testing, likelihood ratios, Bayesian models, jackknife, and bootstrap. But perhaps more often, the impacts came from thoughtful analyses before data were collected, and the questions that arose after the statistical analysis. The impact of understanding the problem, designing the experiment and data collections, conducting the pilot surveys, and raising important questions, is substantial. Through sensible explorations following formal statistical procedures, statisticians have made contributions in many domains. In this presentation, I recall some examples which made a long-lasting impact. Some of them, like randomization in clinical trials, known and familiar to all, are so ingrained in our practice that the role of statistics has been forgotten. Others may be less familiar but nonetheless benefited greatly from the critical input of statisticians. All remind us that our field remains today not only relevant but critical to science and society. Journal: Journal of the American Statistical Association Pages: 491-500 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1761217 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1761217 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:491-500 Template-Type: ReDIF-Article 1.0 Author-Name: Noel Cressie Author-X-Name-First: Noel Author-X-Name-Last: Cressie Title: Comment: When Is It Data Science and When Is It Data Engineering? Journal: Journal of the American Statistical Association Pages: 660-662 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2020.1762619 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762619 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:660-662 Template-Type: ReDIF-Article 1.0 Author-Name: Chih-Li Sung Author-X-Name-First: Chih-Li Author-X-Name-Last: Sung Author-Name: Wenjia Wang Author-X-Name-First: Wenjia Author-X-Name-Last: Wang Author-Name: Matthew Plumlee Author-X-Name-First: Matthew Author-X-Name-Last: Plumlee Author-Name: Benjamin Haaland Author-X-Name-First: Benjamin Author-X-Name-Last: Haaland Title: Multiresolution Functional ANOVA for Large-Scale, Many-Input Computer Experiments Abstract: The Gaussian process is a standard tool for building emulators for both deterministic and stochastic computer experiments. However, application of Gaussian process models is greatly limited in practice, particularly for large-scale and many-input computer experiments that have become typical. We propose a multiresolution functional ANOVA (MRFA) model as a computationally feasible emulation alternative. More generally, this model can be used for large-scale and many-input nonlinear regression problems. An overlapping group lasso approach is used for estimation, ensuring computational feasibility in a large-scale and many-input setting. New results on consistency and inference for the (potentially overlapping) group lasso in a high-dimensional setting are developed and applied to the proposed MRFA model. Importantly, these results allow us to quantify the uncertainty in our predictions. Numerical examples demonstrate that the proposed model enjoys marked computational advantages. Data capabilities, in terms of both sample size and dimension, meet or exceed best available emulation tools while meeting or exceeding emulation accuracy. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 908-919 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1595630 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1595630 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:908-919 Template-Type: ReDIF-Article 1.0 Author-Name: Guan Yu Author-X-Name-First: Guan Author-X-Name-Last: Yu Author-Name: Liang Yin Author-X-Name-First: Liang Author-X-Name-Last: Yin Author-Name: Shu Lu Author-X-Name-First: Shu Author-X-Name-Last: Lu Author-Name: Yufeng Liu Author-X-Name-First: Yufeng Author-X-Name-Last: Liu Title: Confidence Intervals for Sparse Penalized Regression With Random Designs Abstract: With the abundance of large data, sparse penalized regression techniques are commonly used in data analysis due to the advantage of simultaneous variable selection and estimation. A number of convex as well as nonconvex penalties have been proposed in the literature to achieve sparse estimates. Despite intense work in this area, how to perform valid inference for sparse penalized regression with a general penalty remains to be an active research problem. In this article, by making use of state-of-the-art optimization tools in stochastic variational inequality theory, we propose a unified framework to construct confidence intervals for sparse penalized regression with a wide range of penalties, including convex and nonconvex penalties. We study the inference for parameters under the population version of the penalized regression as well as parameters of the underlying linear model. Theoretical convergence properties of the proposed method are obtained. Several simulated and real data examples are presented to demonstrate the validity and effectiveness of the proposed inference procedure. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 794-809 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1585251 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585251 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:794-809 Template-Type: ReDIF-Article 1.0 Author-Name: Hamid Javadi Author-X-Name-First: Hamid Author-X-Name-Last: Javadi Author-Name: Andrea Montanari Author-X-Name-First: Andrea Author-X-Name-Last: Montanari Title: Nonnegative Matrix Factorization Via Archetypal Analysis Abstract: Given a collection of data points, nonnegative matrix factorization (NMF) suggests expressing them as convex combinations of a small set of “archetypes” with nonnegative entries. This decomposition is unique only if the true archetypes are nonnegative and sufficiently sparse (or the weights are sufficiently sparse), a regime that is captured by the separability condition and its generalizations. In this article, we study an approach to NMF that can be traced back to the work of Cutler and Breiman [(1994), “Archetypal Analysis,” Technometrics, 36, 338–347] and does not require the data to be separable, while providing a generally unique decomposition. We optimize a trade-off between two objectives: we minimize the distance of the data points from the convex envelope of the archetypes (which can be interpreted as an empirical risk), while also minimizing the distance of the archetypes from the convex envelope of the data (which can be interpreted as a data-dependent regularization). The archetypal analysis method of Cutler and Breiman is recovered as the limiting case in which the last term is given infinite weight. We introduce a “uniqueness condition” on the data which is necessary for identifiability. We prove that, under uniqueness (plus additional regularity conditions on the geometry of the archetypes), our estimator is robust. While our approach requires solving a nonconvex optimization problem, we find that standard optimization methods succeed in finding good solutions for both real and synthetic data. Supplementary materials for this article are available online Journal: Journal of the American Statistical Association Pages: 896-907 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2019.1594832 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1594832 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:896-907 Template-Type: ReDIF-Article 1.0 Author-Name: Daniel J. Luckett Author-X-Name-First: Daniel J. Author-X-Name-Last: Luckett Author-Name: Eric B. Laber Author-X-Name-First: Eric B. Author-X-Name-Last: Laber Author-Name: Anna R. Kahkoska Author-X-Name-First: Anna R. Author-X-Name-Last: Kahkoska Author-Name: David M. Maahs Author-X-Name-First: David M. Author-X-Name-Last: Maahs Author-Name: Elizabeth Mayer-Davis Author-X-Name-First: Elizabeth Author-X-Name-Last: Mayer-Davis Author-Name: Michael R. Kosorok Author-X-Name-First: Michael R. Author-X-Name-Last: Kosorok Title: Estimating Dynamic Treatment Regimes in Mobile Health Using V-Learning Abstract: The vision for precision medicine is to use individual patient characteristics to inform a personalized treatment plan that leads to the best possible healthcare for each patient. Mobile technologies have an important role to play in this vision as they offer a means to monitor a patient’s health status in real-time and subsequently to deliver interventions if, when, and in the dose that they are needed. Dynamic treatment regimes formalize individualized treatment plans as sequences of decision rules, one per stage of clinical intervention, that map current patient information to a recommended treatment. However, most existing methods for estimating optimal dynamic treatment regimes are designed for a small number of fixed decision points occurring on a coarse time-scale. We propose a new reinforcement learning method for estimating an optimal treatment regime that is applicable to data collected using mobile technologies in an outpatient setting. The proposed method accommodates an indefinite time horizon and minute-by-minute decision making that are common in mobile health applications. We show that the proposed estimators are consistent and asymptotically normal under mild conditions. The proposed methods are applied to estimate an optimal dynamic treatment regime for controlling blood glucose levels in patients with type 1 diabetes. Journal: Journal of the American Statistical Association Pages: 692-706 Issue: 530 Volume: 115 Year: 2020 Month: 4 X-DOI: 10.1080/01621459.2018.1537919 File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537919 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:692-706 Template-Type: ReDIF-Article 1.0 Author-Name: Yixuan Qiu Author-X-Name-First: Yixuan Author-X-Name-Last: Qiu Author-Name: Xiao Wang Author-X-Name-First: Xiao Author-X-Name-Last: Wang Title: ALMOND: Adaptive Latent Modeling and Optimization via Neural Networks and Langevin Diffusion Abstract: Latent variable models cover a broad range of statistical and machine learning models, such as Bayesian models, linear mixed models, and Gaussian mixture models. Existing methods often suffer from two major challenges in practice: (a) a proper latent variable distribution is difficult to be specified; (b) making an exact likelihood inference is formidable due to the intractable computation. We propose a novel framework for the inference of latent variable models that overcomes these two limitations. This new framework allows for a fully data-driven latent variable distribution via deep neural networks, and the proposed stochastic gradient method, combined with the Langevin algorithm, is efficient and suitable for complex models and big data. We provide theoretical results for the Langevin algorithm, and establish the convergence analysis of the optimization method. This framework has demonstrated superior practical performance through simulation studies and a real data analysis. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1224-1236 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2019.1691563 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1691563 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1224-1236 Template-Type: ReDIF-Article 1.0 Author-Name: Qiang Sun Author-X-Name-First: Qiang Author-X-Name-Last: Sun Author-Name: Heping Zhang Author-X-Name-First: Heping Author-X-Name-Last: Zhang Title: Targeted Inference Involving High-Dimensional Data Using Nuisance Penalized Regression Abstract: Analysis of high-dimensional data has received considerable and increasing attention in statistics. In practice, we may not be interested in every variable that is observed. Instead, often some of the variables are of particular interest, and the remaining variables are nuisance. To this end, we propose the nuisance penalized regression which does not penalize the parameters of interest. When the coherence between interest parameters and nuisance parameters is negligible, we show that resulting estimator can be directly used for inference without any correction. When the coherence is not negligible, we propose an iterative procedure to further refine the estimate of interest parameters, based on which we propose a modified profile likelihood based statistic for hypothesis testing. The utilities of our general results are demonstrated in three specific examples. Numerical studies lend further support to our method. Journal: Journal of the American Statistical Association Pages: 1472-1486 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1737079 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1737079 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1472-1486 Template-Type: ReDIF-Article 1.0 Author-Name: Shulei Wang Author-X-Name-First: Shulei Author-X-Name-Last: Wang Author-Name: T. Tony Cai Author-X-Name-First: T. Tony Author-X-Name-Last: Cai Author-Name: Hongzhe Li Author-X-Name-First: Hongzhe Author-X-Name-Last: Li Title: Optimal Estimation of Wasserstein Distance on a Tree With an Application to Microbiome Studies Abstract: The weighted UniFrac distance, a plug-in estimator of the Wasserstein distance of read counts on a tree, has been widely used to measure the microbial community difference in microbiome studies. Our investigation however shows that such a plug-in estimator, although intuitive and commonly used in practice, suffers from potential bias. Motivated by this finding, we study the problem of optimal estimation of the Wasserstein distance between two distributions on a tree from the sampled data in the high-dimensional setting. The minimax rate of convergence is established. To overcome the bias problem, we introduce a new estimator, referred to as the moment-screening estimator on a tree (MET), by using implicit best polynomial approximation that incorporates the tree structure. The new estimator is computationally efficient and is shown to be minimax rate-optimal. Numerical studies using both simulated and real biological datasets demonstrate the practical merits of MET, including reduced biases and statistically more significant differences in microbiome between the inactive Crohn’s disease patients and the normal controls. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1237-1253 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2019.1699422 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1699422 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1237-1253 Template-Type: ReDIF-Article 1.0 Author-Name: Laurens de Haan Author-X-Name-First: Laurens Author-X-Name-Last: de Haan Author-Name: Chen Zhou Author-X-Name-First: Chen Author-X-Name-Last: Zhou Title: Trends in Extreme Value Indices Abstract: We consider extreme value analysis for independent but nonidentically distributed observations. In particular, the observations do not share the same extreme value index. Assuming continuously changing extreme value indices, we provide a nonparametric estimate for the functional extreme value index. Besides estimating the extreme value index locally, we also provide a global estimator for the trend and its joint asymptotic theory. The asymptotic theory for the global estimator can be used for testing a prespecified parametric trend in the extreme value indices. In particular, it can be applied to test whether the extreme value index remains at a constant level across all observations. Journal: Journal of the American Statistical Association Pages: 1265-1279 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2019.1705307 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1705307 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1265-1279 Template-Type: ReDIF-Article 1.0 Author-Name: Stephen Bates Author-X-Name-First: Stephen Author-X-Name-Last: Bates Author-Name: Emmanuel Candès Author-X-Name-First: Emmanuel Author-X-Name-Last: Candès Author-Name: Lucas Janson Author-X-Name-First: Lucas Author-X-Name-Last: Janson Author-Name: Wenshuo Wang Author-X-Name-First: Wenshuo Author-X-Name-Last: Wang Title: Metropolized Knockoff Sampling Abstract: Model-X knockoffs is a wrapper that transforms essentially any feature importance measure into a variable selection algorithm, which discovers true effects while rigorously controlling the expected fraction of false positives. A frequently discussed challenge to apply this method is to construct knockoff variables, which are synthetic variables obeying a crucial exchangeability property with the explanatory variables under study. This article introduces techniques for knockoff generation in great generality: we provide a sequential characterization of all possible knockoff distributions, which leads to a Metropolis–Hastings formulation of an exact knockoff sampler. We further show how to use conditional independence structure to speed up computations. Combining these two threads, we introduce an explicit set of sequential algorithms and empirically demonstrate their effectiveness. Our theoretical analysis proves that our algorithms achieve near-optimal computational complexity in certain cases. The techniques we develop are sufficiently rich to enable knockoff sampling in challenging models including cases where the covariates are continuous and heavy-tailed, and follow a graphical model such as the Ising model. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1413-1427 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1729163 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1729163 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1413-1427 Template-Type: ReDIF-Article 1.0 Author-Name: Pierre E. Jacob Author-X-Name-First: Pierre E. Author-X-Name-Last: Jacob Author-Name: Ruobin Gong Author-X-Name-First: Ruobin Author-X-Name-Last: Gong Author-Name: Paul T. Edlefsen Author-X-Name-First: Paul T. Author-X-Name-Last: Edlefsen Author-Name: Arthur P. Dempster Author-X-Name-First: Arthur P. Author-X-Name-Last: Dempster Title: A Gibbs Sampler for a Class of Random Convex Polytopes Abstract: We present a Gibbs sampler for the Dempster–Shafer (DS) approach to statistical inference for categorical distributions. The DS framework extends the Bayesian approach, allows in particular the use of partial prior information, and yields three-valued uncertainty assessments representing probabilities “for,” “against,” and “don’t know” about formal assertions of interest. The proposed algorithm targets the distribution of a class of random convex polytopes which encapsulate the DS inference. The sampler relies on an equivalence between the iterative constraints of the vertex configuration and the nonnegativity of cycles in a fully connected directed graph. Illustrations include the testing of independence in 2 × 2 contingency tables and parameter estimation of the linkage model. Journal: Journal of the American Statistical Association Pages: 1181-1192 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2021.1881523 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1881523 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1181-1192 Template-Type: ReDIF-Article 1.0 Author-Name: Trevor Harris Author-X-Name-First: Trevor Author-X-Name-Last: Harris Author-Name: Bo Li Author-X-Name-First: Bo Author-X-Name-Last: Li Author-Name: Nathan J. Steiger Author-X-Name-First: Nathan J. Author-X-Name-Last: Steiger Author-Name: Jason E. Smerdon Author-X-Name-First: Jason E. Author-X-Name-Last: Smerdon Author-Name: Naveen Narisetty Author-X-Name-First: Naveen Author-X-Name-Last: Narisetty Author-Name: J. Derek Tucker Author-X-Name-First: J. Derek Author-X-Name-Last: Tucker Title: Evaluating Proxy Influence in Assimilated Paleoclimate Reconstructions—Testing the Exchangeability of Two Ensembles of Spatial Processes Abstract: Abstract–Climate field reconstructions (CFRs) attempt to estimate spatiotemporal fields of climate variables in the past using climate proxies such as tree rings, ice cores, and corals. Data assimilation (DA) methods are a recent and promising new means of deriving CFRs that optimally fuse climate proxies with climate model output. Despite the growing application of DA-based CFRs, little is understood about how much the assimilated proxies change the statistical properties of the climate model data. To address this question, we propose a robust and computationally efficient method, based on functional data depth, to evaluate differences in the distributions of two spatiotemporal processes. We apply our test to study global and regional proxy influence in DA-based CFRs by comparing the background and analysis states, which are treated as two samples of spatiotemporal fields. We find that the analysis states are significantly altered from the climate-model-based background states due to the assimilation of proxies. Moreover, the difference between the analysis and background states increases with the number of proxies, even in regions far beyond proxy collection sites. Our approach allows us to characterize the added value of proxies, indicating where and when the analysis states are distinct from the background states. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1100-1113 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1799810 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1799810 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1100-1113 Template-Type: ReDIF-Article 1.0 Author-Name: Abhra Sarkar Author-X-Name-First: Abhra Author-X-Name-Last: Sarkar Author-Name: Debdeep Pati Author-X-Name-First: Debdeep Author-X-Name-Last: Pati Author-Name: Bani K. Mallick Author-X-Name-First: Bani K. Author-X-Name-Last: Mallick Author-Name: Raymond J. Carroll Author-X-Name-First: Raymond J. Author-X-Name-Last: Carroll Title: Bayesian Copula Density Deconvolution for Zero-Inflated Data in Nutritional Epidemiology Abstract: Estimating the marginal and joint densities of the long-term average intakes of different dietary components is an important problem in nutritional epidemiology. Since these variables cannot be directly measured, data are usually collected in the form of 24-hr recalls of the intakes, which show marked patterns of conditional heteroscedasticity. Significantly compounding the challenges, the recalls for episodically consumed dietary components also include exact zeros. The problem of estimating the density of the latent long-time intakes from their observed measurement error contaminated proxies is then a problem of deconvolution of densities with zero-inflated data. We propose a Bayesian semiparametric solution to the problem, building on a novel hierarchical latent variable framework that translates the problem to one involving continuous surrogates only. Crucial to accommodating important aspects of the problem, we then design a copula based approach to model the involved joint distributions, adopting different modeling strategies for the marginals of the different dietary components. We design efficient Markov chain Monte Carlo algorithms for posterior inference and illustrate the efficacy of the proposed method through simulation experiments. Applied to our motivating nutritional epidemiology problems, compared to other approaches, our method provides more realistic estimates of the consumption patterns of episodically consumed dietary components. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1075-1087 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1782220 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1782220 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1075-1087 Template-Type: ReDIF-Article 1.0 Author-Name: Richard A. Davis Author-X-Name-First: Richard A. Author-X-Name-Last: Davis Author-Name: Konstantinos Fokianos Author-X-Name-First: Konstantinos Author-X-Name-Last: Fokianos Author-Name: Scott H. Holan Author-X-Name-First: Scott H. Author-X-Name-Last: Holan Author-Name: Harry Joe Author-X-Name-First: Harry Author-X-Name-Last: Joe Author-Name: James Livsey Author-X-Name-First: James Author-X-Name-Last: Livsey Author-Name: Robert Lund Author-X-Name-First: Robert Author-X-Name-Last: Lund Author-Name: Vladas Pipiras Author-X-Name-First: Vladas Author-X-Name-Last: Pipiras Author-Name: Nalini Ravishanker Author-X-Name-First: Nalini Author-X-Name-Last: Ravishanker Title: Count Time Series: A Methodological Review Abstract: A growing interest in non-Gaussian time series, particularly in series comprised of nonnegative integers (counts), is taking place in today’s statistics literature. Count series naturally arise in fields, such as agriculture, economics, epidemiology, finance, geology, meteorology, and sports. Unlike stationary Gaussian series where autoregressive moving-averages are the primary modeling vehicle, no single class of models dominates the count landscape. As such, the literature has evolved somewhat ad-hocly, with different model classes being developed to tackle specific situations. This article is an attempt to summarize the current state of count time series modeling. The article first reviews models having prescribed marginal distributions, including some recent developments. This is followed by a discussion of state-space approaches. Multivariate extensions of the methods are then studied and Bayesian approaches to the problem are considered. The intent is to inform researchers and practitioners about the various types of count time series models arising in the modern literature. While estimation issues are not pursued in detail, reference to this literature is made. Journal: Journal of the American Statistical Association Pages: 1533-1547 Issue: 535 Volume: 116 Year: 2021 Month: 5 X-DOI: 10.1080/01621459.2021.1904957 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1904957 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1533-1547 Template-Type: ReDIF-Article 1.0 Author-Name: Thomas Kuenzer Author-X-Name-First: Thomas Author-X-Name-Last: Kuenzer Author-Name: Siegfried Hörmann Author-X-Name-First: Siegfried Author-X-Name-Last: Hörmann Author-Name: Piotr Kokoszka Author-X-Name-First: Piotr Author-X-Name-Last: Kokoszka Title: Principal Component Analysis of Spatially Indexed Functions Abstract: We develop an expansion, similar in some respects to the Karhunen–Loève expansion, but which is more suitable for functional data indexed by spatial locations on a grid. Unlike the traditional Karhunen–Loève expansion, it takes into account the spatial dependence between the functions. By doing so, it provides a more efficient dimension reduction tool, both theoretically and in finite samples, for functional data with moderate spatial dependence. For such data, it also possesses other theoretical and practical advantages over the currently used approach. The article develops complete asymptotic theory and estimation methodology. The performance of the method is examined by a simulation study and data analysis. The new tools are implemented in an R package. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1444-1456 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1732395 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1732395 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1444-1456 Template-Type: ReDIF-Article 1.0 Author-Name: Matteo Fasiolo Author-X-Name-First: Matteo Author-X-Name-Last: Fasiolo Author-Name: Simon N. Wood Author-X-Name-First: Simon N. Author-X-Name-Last: Wood Author-Name: Margaux Zaffran Author-X-Name-First: Margaux Author-X-Name-Last: Zaffran Author-Name: Raphaël Nedellec Author-X-Name-First: Raphaël Author-X-Name-Last: Nedellec Author-Name: Yannig Goude Author-X-Name-First: Yannig Author-X-Name-Last: Goude Title: Fast Calibrated Additive Quantile Regression Abstract: We propose a novel framework for fitting additive quantile regression models, which provides well-calibrated inference about the conditional quantiles and fast automatic estimation of the smoothing parameters, for model structures as diverse as those usable with distributional generalized additive models, while maintaining equivalent numerical efficiency and stability. The proposed methods are at once statistically rigorous and computationally efficient, because they are based on the general belief updating framework of Bissiri, Holmes, and Walker to loss based inference, but compute by adapting the stable fitting methods of Wood, Pya, and Säfken. We show how the pinball loss is statistically suboptimal relative to a novel smooth generalization, which also gives access to fast estimation methods. Further, we provide a novel calibration method for efficiently selecting the “learning rate” balancing the loss with the smoothing priors during inference, thereby obtaining reliable quantile uncertainty estimates. Our work was motivated by a probabilistic electricity load forecasting application, used here to demonstrate the proposed approach. The methods described here are implemented by the qgam R package, available on the Comprehensive R Archive Network (CRAN). Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1402-1412 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1725521 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1725521 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1402-1412 Template-Type: ReDIF-Article 1.0 Author-Name: Jonathan P Williams Author-X-Name-First: Jonathan P Author-X-Name-Last: Williams Title: Discussion of “A Gibbs Sampler for a Class of Random Convex Polytopes” Journal: Journal of the American Statistical Association Pages: 1198-1200 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2021.1946405 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1946405 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1198-1200 Template-Type: ReDIF-Article 1.0 Author-Name: Jianwei Hu Author-X-Name-First: Jianwei Author-X-Name-Last: Hu Author-Name: Jingfei Zhang Author-X-Name-First: Jingfei Author-X-Name-Last: Zhang Author-Name: Hong Qin Author-X-Name-First: Hong Author-X-Name-Last: Qin Author-Name: Ting Yan Author-X-Name-First: Ting Author-X-Name-Last: Yan Author-Name: Ji Zhu Author-X-Name-First: Ji Author-X-Name-Last: Zhu Title: Using Maximum Entry-Wise Deviation to Test the Goodness of Fit for Stochastic Block Models Abstract: Abstract–The stochastic block model is widely used for detecting community structures in network data. How to test the goodness of fit of the model is one of the fundamental problems and has gained growing interests in recent years. In this article, we propose a novel goodness-of-fit test based on the maximum entry of the centered and rescaled adjacency matrix for the stochastic block model. One noticeable advantage of the proposed test is that the number of communities can be allowed to grow linearly with the number of nodes ignoring a logarithmic factor. We prove that the null distribution of the test statistic converges in distribution to a Gumbel distribution, and we show that both the number of communities and the membership vector can be tested via the proposed method. Furthermore, we show that the proposed test has asymptotic power guarantee against a class of alternatives. We also demonstrate that the proposed method can be extended to the degree-corrected stochastic block model. Both simulation studies and real-world data examples indicate that the proposed method works well. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1373-1382 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1722676 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1722676 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1373-1382 Template-Type: ReDIF-Article 1.0 Author-Name: Bingying Xie Author-X-Name-First: Bingying Author-X-Name-Last: Xie Author-Name: Jun Shao Author-X-Name-First: Jun Author-X-Name-Last: Shao Title: Nonparametric Estimation of Conditional Expectation with Auxiliary Information and Dimension Reduction Abstract: Nonparametric estimation of the conditional expectation E(Y|U) of an outcome Y given a covariate vector U is of primary importance in many statistical applications such as prediction and personalized medicine. In some problems, there is an additional auxiliary variable Z in the training dataset used to construct estimators, but Z is not available for future prediction or selecting patient treatment in personalized medicine. For example, in the training dataset longitudinal outcomes are observed, but only the last outcome Y is concerned in the future prediction or analysis. The longitudinal outcomes other than the last point is then the variable Z that is observed and related with both Y and U. Previous work on how to make use of Z in the estimation of E(Y|U) mainly focused on using Z in the construction of a linear function of U to reduce covariate dimension for better estimation. Using E(Y|U)=E{E(Y|U,Z)|U} , we propose a two-step estimation of inner and outer expectations, respectively, with sufficient dimension reduction for kernel estimation in both steps. The information from Z is utilized not only in dimension reduction, but also directly in the estimation. Because of the existence of different ways for dimension reduction, we construct two estimators that may improve the estimator without using Z. The improvements are shown in the convergence rate of estimators as the sample size increases to infinity as well as in the finite sample simulation performance. A real data analysis about the selection of mammography intervention is presented for illustration. Journal: Journal of the American Statistical Association Pages: 1346-1357 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1713793 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1713793 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1346-1357 Template-Type: ReDIF-Article 1.0 Author-Name: Earl Lawrence Author-X-Name-First: Earl Author-X-Name-Last: Lawrence Author-Name: Scott Vander Wiel Author-X-Name-First: Scott Vander Author-X-Name-Last: Wiel Title: Comment on “A Gibbs Sampler for a Class of Random Convex Polytopes” Journal: Journal of the American Statistical Association Pages: 1201-1203 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2021.1947305 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1947305 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1201-1203 Template-Type: ReDIF-Article 1.0 Author-Name: Ian Laga Author-X-Name-First: Ian Author-X-Name-Last: Laga Author-Name: Le Bao Author-X-Name-First: Le Author-X-Name-Last: Bao Author-Name: Xiaoyue Niu Author-X-Name-First: Xiaoyue Author-X-Name-Last: Niu Title: Thirty Years of The Network Scale-up Method Abstract: Estimating the size of hard-to-reach populations is an important problem for many fields. The network scale-up method (NSUM) is a relatively new approach to estimate the size of these hard-to-reach populations by asking respondents the question, “How many X’s do you know,” where X is the population of interest (e.g., “How many female sex workers do you know?”). The answers to these questions form aggregated relational data (ARD). The NSUM has been used to estimate the size of a variety of subpopulations, including female sex workers, drug users, and even children who have been hospitalized for choking. Within the network scale-up methodology, there are a multitude of estimators for the size of the hidden population, including direct estimators, maximum likelihood estimators, and Bayesian estimators. In this article, we first provide an in-depth analysis of ARD properties and the techniques to collect the data. Then, we comprehensively review different estimation methods in terms of the assumptions behind each model, the relationships between the estimators, and the practical considerations of implementing the methods. We apply many of the models discussed in the review to one canonical dataset and compare their performance and unique features, presented in the supplementary materials. Finally, we provide a summary of the dominant methods and an extensive list of the applications, and discuss the open problems and potential research directions in this area. Journal: Journal of the American Statistical Association Pages: 1548-1559 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2021.1935267 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1935267 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1548-1559 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Correction Journal: Journal of the American Statistical Association Pages: 1560-1560 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2021.1957322 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1957322 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1560-1560 Template-Type: ReDIF-Article 1.0 Author-Name: Giorgio Paulon Author-X-Name-First: Giorgio Author-X-Name-Last: Paulon Author-Name: Fernando Llanos Author-X-Name-First: Fernando Author-X-Name-Last: Llanos Author-Name: Bharath Chandrasekaran Author-X-Name-First: Bharath Author-X-Name-Last: Chandrasekaran Author-Name: Abhra Sarkar Author-X-Name-First: Abhra Author-X-Name-Last: Sarkar Title: Bayesian Semiparametric Longitudinal Drift-Diffusion Mixed Models for Tone Learning in Adults Abstract: Abstract–Understanding how adult humans learn nonnative speech categories such as tone information has shed novel insights into the mechanisms underlying experience-dependent brain plasticity. Scientists have traditionally examined these questions using longitudinal learning experiments under a multi-category decision making paradigm. Drift-diffusion processes are popular in such contexts for their ability to mimic underlying neural mechanisms. Motivated by these problems, we develop a novel Bayesian semiparametric inverse Gaussian drift-diffusion mixed model for multi-alternative decision making in longitudinal settings. We design a Markov chain Monte Carlo algorithm for posterior computation. We evaluate the method’s empirical performances through synthetic experiments. Applied to our motivating longitudinal tone learning study, the method provides novel insights into how the biologically interpretable model parameters evolve with learning, differ between input-response tone combinations, and differ between well and poorly performing adults. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1114-1127 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1801448 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801448 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1114-1127 Template-Type: ReDIF-Article 1.0 Author-Name: Persi Diaconis Author-X-Name-First: Persi Author-X-Name-Last: Diaconis Author-Name: Guanyang Wang Author-X-Name-First: Guanyang Author-X-Name-Last: Wang Title: Discussion of “A Gibbs Sampler for a Class of Random Convex Polytopes” Journal: Journal of the American Statistical Association Pages: 1193-1195 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2021.1950000 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1950000 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1193-1195 Template-Type: ReDIF-Article 1.0 Author-Name: Naim U. Rashid Author-X-Name-First: Naim U. Author-X-Name-Last: Rashid Author-Name: Daniel J. Luckett Author-X-Name-First: Daniel J. Author-X-Name-Last: Luckett Author-Name: Jingxiang Chen Author-X-Name-First: Jingxiang Author-X-Name-Last: Chen Author-Name: Michael T. Lawson Author-X-Name-First: Michael T. Author-X-Name-Last: Lawson Author-Name: Longshaokan Wang Author-X-Name-First: Longshaokan Author-X-Name-Last: Wang Author-Name: Yunshu Zhang Author-X-Name-First: Yunshu Author-X-Name-Last: Zhang Author-Name: Eric B. Laber Author-X-Name-First: Eric B. Author-X-Name-Last: Laber Author-Name: Yufeng Liu Author-X-Name-First: Yufeng Author-X-Name-Last: Liu Author-Name: Jen Jen Yeh Author-X-Name-First: Jen Jen Author-X-Name-Last: Yeh Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Michael R. Kosorok Author-X-Name-First: Michael R. Author-X-Name-Last: Kosorok Title: High-Dimensional Precision Medicine From Patient-Derived Xenografts Abstract: The complexity of human cancer often results in significant heterogeneity in response to treatment. Precision medicine offers the potential to improve patient outcomes by leveraging this heterogeneity. Individualized treatment rules (ITRs) formalize precision medicine as maps from the patient covariate space into the space of allowable treatments. The optimal ITR is that which maximizes the mean of a clinical outcome in a population of interest. Patient-derived xenograft (PDX) studies permit the evaluation of multiple treatments within a single tumor, and thus are ideally suited for estimating optimal ITRs. PDX data are characterized by correlated outcomes, a high-dimensional feature space, and a large number of treatments. Here we explore machine learning methods for estimating optimal ITRs from PDX data. We analyze data from a large PDX study to identify biomarkers that are informative for developing personalized treatment recommendations in multiple cancers. We estimate optimal ITRs using regression-based (Q-learning) and direct-search methods (outcome weighted learning). Finally, we implement a superlearner approach to combine multiple estimated ITRs and show that the resulting ITR performs better than any of the input ITRs, mitigating uncertainty regarding user choice. Our results indicate that PDX data are a valuable resource for developing individualized treatment strategies in oncology. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1140-1154 Issue: 535 Volume: 116 Year: 2020 Month: 11 X-DOI: 10.1080/01621459.2020.1828091 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1828091 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2020:i:535:p:1140-1154 Template-Type: ReDIF-Article 1.0 Author-Name: Xinzhou Guo Author-X-Name-First: Xinzhou Author-X-Name-Last: Guo Author-Name: Xuming He Author-X-Name-First: Xuming Author-X-Name-Last: He Title: Inference on Selected Subgroups in Clinical Trials Abstract: When existing clinical trial data suggest a promising subgroup, we must address the question of how good the selected subgroup really is. The usual statistical inference applied to the selected subgroup, assuming that the subgroup is chosen independent of the data, may lead to an overly optimistic evaluation of the selected subgroup. In this article, we address the issue of selection bias and develop a de-biasing bootstrap inference procedure for the best selected subgroup effect. The proposed inference procedure is model-free, easy to compute, and asymptotically sharp. We demonstrate the merit of our proposed method by reanalyzing the MONET1 trial and show that how the subgroup is selected post hoc should play an important role in any statistical analysis. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1498-1506 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1740096 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1740096 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1498-1506 Template-Type: ReDIF-Article 1.0 Author-Name: Glenn Shafer Author-X-Name-First: Glenn Author-X-Name-Last: Shafer Title: Comment on “A Gibbs Sampler for a Class of Random Convex Polytopes,” by Pierre E. Jacob, Ruobin Gong, Paul T. Edlefsen, and Arthur P. Dempster Journal: Journal of the American Statistical Association Pages: 1196-1197 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2021.1950001 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1950001 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1196-1197 Template-Type: ReDIF-Article 1.0 Author-Name: Claudio Heinrich Author-X-Name-First: Claudio Author-X-Name-Last: Heinrich Author-Name: Kristoffer H. Hellton Author-X-Name-First: Kristoffer H. Author-X-Name-Last: Hellton Author-Name: Alex Lenkoski Author-X-Name-First: Alex Author-X-Name-Last: Lenkoski Author-Name: Thordis L. Thorarinsdottir Author-X-Name-First: Thordis L. Author-X-Name-Last: Thorarinsdottir Title: Multivariate Postprocessing Methods for High-Dimensional Seasonal Weather Forecasts Abstract: Abstract–Seasonal weather forecasts are crucial for long-term planning in many practical situations and skillful forecasts may have substantial economic and humanitarian implications. Current seasonal forecasting models require statistical postprocessing of the output to correct systematic biases and unrealistic uncertainty assessments. We propose a multivariate postprocessing approach using covariance tapering, combined with a dimension reduction step based on principal component analysis for efficient computation. Our proposed technique can correctly and efficiently handle nonstationary, non-isotropic and negatively correlated spatial error patterns, and is applicable on a global scale. Further, a moving average approach to marginal postprocessing is shown to flexibly handle trends in biases caused by global warming, and short training periods. In an application to global sea surface temperature forecasts issued by the Norwegian climate prediction model, our proposed methodology is shown to outperform known reference methods. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1048-1059 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1769634 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1769634 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1048-1059 Template-Type: ReDIF-Article 1.0 Author-Name: Ben Dai Author-X-Name-First: Ben Author-X-Name-Last: Dai Author-Name: Xiaotong Shen Author-X-Name-First: Xiaotong Author-X-Name-Last: Shen Author-Name: Junhui Wang Author-X-Name-First: Junhui Author-X-Name-Last: Wang Author-Name: Annie Qu Author-X-Name-First: Annie Author-X-Name-Last: Qu Title: Scalable Collaborative Ranking for Personalized Prediction Abstract: Personalized prediction presents an important yet challenging task, which predicts user-specific preferences on a large number of items given limited information. It is often modeled as certain recommender systems focusing on ordinal or continuous ratings, as in collaborative filtering and content-based filtering. In this article, we propose a new collaborative ranking system to predict most-preferred items for each user given search queries. Particularly, we propose a ψ-ranker based on ranking functions incorporating information on users, items, and search queries through latent factor models. Moreover, we show that the proposed nonconvex surrogate pairwise ψ-loss performs well under four popular bipartite ranking losses, such as the sum loss, pairwise zero-one loss, discounted cumulative gain, and mean average precision. We develop a parallel computing strategy to optimize the intractable loss of two levels of nonconvex components through difference of convex programming and block successive upper-bound minimization. Theoretically, we establish a probabilistic error bound for the ψ-ranker and show that its ranking error has a sharp rate of convergence in the general framework of bipartite ranking, even when the dimension of the model parameters diverges with the sample size. Consequently, this result also indicates that the ψ-ranker performs better than two major approaches in bipartite ranking: pairwise ranking and scoring. Finally, we demonstrate the utility of the ψ-ranker by comparing it with some strong competitors in the literature through simulated examples as well as Expedia booking data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1215-1223 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2019.1691562 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1691562 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1215-1223 Template-Type: ReDIF-Article 1.0 Author-Name: Lax Chan Author-X-Name-First: Lax Author-X-Name-Last: Chan Author-Name: Bernard W. Silverman Author-X-Name-First: Bernard W. Author-X-Name-Last: Silverman Author-Name: Kyle Vincent Author-X-Name-First: Kyle Author-X-Name-Last: Vincent Title: Multiple Systems Estimation for Sparse Capture Data: Inferential Challenges When There Are Nonoverlapping Lists Abstract: Multiple systems estimation strategies have recently been applied to quantify hard-to-reach populations, particularly when estimating the number of victims of human trafficking and modern slavery. In such contexts, it is not uncommon to see sparse or even no overlap between some of the lists on which the estimates are based. These create difficulties in model fitting and selection, and we develop inference procedures to address these challenges. The approach is based on Poisson log-linear regression modeling. Issues investigated in detail include taking proper account of data sparsity in the estimation procedure, as well as the existence and identifiability of maximum likelihood estimates. A stepwise method for choosing the most suitable parameters is developed, together with a bootstrap approach to finding confidence intervals for the total population size. We apply the strategy to two empirical datasets of trafficking in US regions, and find that the approach results in stable, reasonable estimates. An accompanying R software implementation has been made publicly available. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1297-1306 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2019.1708748 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1708748 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1297-1306 Template-Type: ReDIF-Article 1.0 Author-Name: Chengchun Shi Author-X-Name-First: Chengchun Author-X-Name-Last: Shi Author-Name: Rui Song Author-X-Name-First: Rui Author-X-Name-Last: Song Author-Name: Wenbin Lu Author-X-Name-First: Wenbin Author-X-Name-Last: Lu Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Title: Statistical Inference for High-Dimensional Models via Recursive Online-Score Estimation Abstract: In this article, we develop a new estimation and valid inference method for single or low-dimensional regression coefficients in high-dimensional generalized linear models. The number of the predictors is allowed to grow exponentially fast with respect to the sample size. The proposed estimator is computed by solving a score function. We recursively conduct model selection to reduce the dimensionality from high to a moderate scale and construct the score equation based on the selected variables. The proposed confidence interval (CI) achieves valid coverage without assuming consistency of the model selection procedure. When the selection consistency is achieved, we show the length of the proposed CI is asymptotically the same as the CI of the “oracle” method which works as well as if the support of the control variables were known. In addition, we prove the proposed CI is asymptotically narrower than the CIs constructed based on the desparsified Lasso estimator and the decorrelated score statistic. Simulation studies and real data applications are presented to back up our theoretical findings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1307-1318 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2019.1710154 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1710154 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1307-1318 Template-Type: ReDIF-Article 1.0 Author-Name: Rong Ma Author-X-Name-First: Rong Author-X-Name-Last: Ma Author-Name: T. Tony Cai Author-X-Name-First: T. Author-X-Name-Last: Tony Cai Author-Name: Hongzhe Li Author-X-Name-First: Hongzhe Author-X-Name-Last: Li Title: Optimal Permutation Recovery in Permuted Monotone Matrix Model Abstract: Motivated by recent research on quantifying bacterial growth dynamics based on genome assemblies, we consider a permuted monotone matrix model Y=ΘΠ+Z , where the rows represent different samples, the columns represent contigs in genome assemblies and the elements represent log-read counts after preprocessing steps and Guanine-Cytosine (GC) adjustment. In this model, Θ is an unknown mean matrix with monotone entries for each row, Π is a permutation matrix that permutes the columns of Θ, and Z is a noise matrix. This article studies the problem of estimation/recovery of Π given the observed noisy matrix Y. We propose an estimator based on the best linear projection, which is shown to be minimax rate-optimal for both exact recovery, as measured by the 0-1 loss, and partial recovery, as quantified by the normalized Kendall’s tau distance. Simulation studies demonstrate the superior empirical performance of the proposed estimator over alternative methods. We demonstrate the methods using a synthetic metagenomics dataset of 45 closely related bacterial species and a real metagenomic dataset to compare the bacterial growth dynamics between the responders and the nonresponders of the IBD patients after 8 weeks of treatment. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1358-1372 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1713794 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1713794 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1358-1372 Template-Type: ReDIF-Article 1.0 Author-Name: Jairo Diaz-Rodriguez Author-X-Name-First: Jairo Author-X-Name-Last: Diaz-Rodriguez Author-Name: Dominique Eckert Author-X-Name-First: Dominique Author-X-Name-Last: Eckert Author-Name: Hatef Monajemi Author-X-Name-First: Hatef Author-X-Name-Last: Monajemi Author-Name: Stéphane Paltani Author-X-Name-First: Stéphane Author-X-Name-Last: Paltani Author-Name: Sylvain Sardy Author-X-Name-First: Sylvain Author-X-Name-Last: Sardy Title: Nonparametric Estimation of Galaxy Cluster Emissivity and Detection of Point Sources in Astrophysics With Two Lasso Penalties Abstract: Astrophysicists are interested in recovering the three-dimensional gas emissivity of a galaxy cluster from a two-dimensional telescope image. Blurring and point sources make this inverse problem harder to solve. The conventional approach requires in a first step to identify and mask the point sources. Instead we model all astrophysical components in a single Poisson generalized linear model. To enforce sparsity on the parameters, maximum likelihood estimation is regularized with two l1 penalties with weights λ 1 for the radial emissivity and λ 2 for the point sources. The method has the advantage of not employing cross-validation to select λ 1 and λ 2. To judge the significance of interesting features, we quantify uncertainty with the bootstrap. We apply our method to two X-ray telescopes (XMM-Newton and Chandra) data to estimate gas emissivity. The results are more stable and seems less biased than the conventional method, in particular in the outskirt of galaxy clusters. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1088-1099 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1796676 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1796676 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1088-1099 Template-Type: ReDIF-Article 1.0 Author-Name: Wendy L. Martinez Author-X-Name-First: Wendy L. Author-X-Name-Last: Martinez Title: Back to Our Future: Text Analytics Insights Abstract: Abstract–Each year, the Journal of the American Statistical Association (ASA) publishes the presidential address from the Joint Statistical Meetings (JSM). Here, we present the 2020 address verbatim save for the addition of references and a few minor editorial corrections. Journal: Journal of the American Statistical Association Pages: 1039-1047 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2021.1960760 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1960760 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1039-1047 Template-Type: ReDIF-Article 1.0 Author-Name: Aurore Delaigle Author-X-Name-First: Aurore Author-X-Name-Last: Delaigle Author-Name: Peter Hall Author-X-Name-First: Peter Author-X-Name-Last: Hall Author-Name: Wei Huang Author-X-Name-First: Wei Author-X-Name-Last: Huang Author-Name: Alois Kneip Author-X-Name-First: Alois Author-X-Name-Last: Kneip Title: Estimating the Covariance of Fragmented and Other Related Types of Functional Data Abstract: We consider the problem of estimating the covariance function of functional data which are only observed on a subset of their domain, such as fragments observed on small intervals or related types of functional data. We focus on situations where the data enable to compute the empirical covariance function or smooth versions of it only on a subset of its domain which contains a diagonal band. We show that estimating the covariance function consistently outside that subset is possible as long as the curves are sufficiently smooth. We establish conditions under which the covariance function is identifiable on its entire domain and propose a tensor product series approach for estimating it consistently. We derive asymptotic properties of our estimator and illustrate its finite sample properties on simulated and real data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1383-1401 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1723597 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1723597 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1383-1401 Template-Type: ReDIF-Article 1.0 Author-Name: Meiling Hao Author-X-Name-First: Meiling Author-X-Name-Last: Hao Author-Name: Kin-yat Liu Author-X-Name-First: Kin-yat Author-X-Name-Last: Liu Author-Name: Wei Xu Author-X-Name-First: Wei Author-X-Name-Last: Xu Author-Name: Xingqiu Zhao Author-X-Name-First: Xingqiu Author-X-Name-Last: Zhao Title: Semiparametric Inference for the Functional Cox Model Abstract: This article studies penalized semiparametric maximum partial likelihood estimation and hypothesis testing for the functional Cox model in analyzing right-censored data with both functional and scalar predictors. Deriving the asymptotic joint distribution of finite-dimensional and infinite-dimensional estimators is a very challenging theoretical problem due to the complexity of semiparametric models. For the problem, we construct the Sobolev space equipped with a special inner product and discover a new joint Bahadur representation of estimators of the unknown slope function and coefficients. Using this key tool, we establish the asymptotic joint normality of the proposed estimators and the weak convergence of the estimated slope function, and then construct local and global confidence intervals for an unknown slope function. Furthermore, we study a penalized partial likelihood ratio test, show that the test statistic enjoys the Wilks phenomenon, and also verify the optimality of the test. The theoretical results are examined through simulation studies, and a right-censored data example from the Improving Care of Acute Lung Injury Patients study is provided for illustration. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1319-1329 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2019.1710155 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1710155 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1319-1329 Template-Type: ReDIF-Article 1.0 Author-Name: Malka Gorfine Author-X-Name-First: Malka Author-X-Name-Last: Gorfine Author-Name: Nir Keret Author-X-Name-First: Nir Author-X-Name-Last: Keret Author-Name: Asaf Ben Arie Author-X-Name-First: Asaf Author-X-Name-Last: Ben Arie Author-Name: David Zucker Author-X-Name-First: David Author-X-Name-Last: Zucker Author-Name: Li Hsu Author-X-Name-First: Li Author-X-Name-Last: Hsu Title: Marginalized Frailty-Based Illness-Death Model: Application to the UK-Biobank Survival Data Abstract: The UK Biobank is a large-scale health resource comprising genetic, environmental, and medical information on approximately 500,000 volunteer participants in the United Kingdom, recruited at ages 40–69 during the years 2006–2010. The project monitors the health and well-being of its participants. This work demonstrates how these data can be used to yield the building blocks for an interpretable risk-prediction model, in a semiparametric fashion, based on known genetic and environmental risk factors of various chronic diseases, such as colorectal cancer. An illness-death model is adopted, which inherently is a semi-competing risks model, since death can censor the disease, but not vice versa. Using a shared-frailty approach to account for the dependence between time to disease diagnosis and time to death, we provide a new illness-death model that assumes Cox models for the marginal hazard functions. The recruitment procedure used in this study introduces delayed entry to the data. An additional challenge arising from the recruitment procedure is that information coming from both prevalent and incident cases must be aggregated. Lastly, we do not observe any deaths prior to the minimal recruitment age, 40. In this work, we provide an estimation procedure for our new illness-death model that overcomes all the above challenges. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1155-1167 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1831922 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1831922 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1155-1167 Template-Type: ReDIF-Article 1.0 Author-Name: Rachel C. Nethery Author-X-Name-First: Rachel C. Author-X-Name-Last: Nethery Author-Name: Fabrizia Mealli Author-X-Name-First: Fabrizia Author-X-Name-Last: Mealli Author-Name: Jason D. Sacks Author-X-Name-First: Jason D. Author-X-Name-Last: Sacks Author-Name: Francesca Dominici Author-X-Name-First: Francesca Author-X-Name-Last: Dominici Title: Evaluation of the health impacts of the 1990 Clean Air Act Amendments using causal inference and machine learning Abstract: We develop a causal inference approach to estimate the number of adverse health events that were prevented due to changes in exposure to multiple pollutants attributable to a large-scale air quality intervention/regulation, with a focus on the 1990 Clean Air Act Amendments (CAAA). We introduce a causal estimand called the Total Events Avoided (TEA) by the regulation, defined as the difference in the number of health events expected under the no-regulation pollution exposures and the number observed with-regulation. We propose matching and machine learning methods that leverage population-level pollution and health data to estimate the TEA. Our approach improves upon traditional methods for regulation health impact analyses by formalizing causal identifying assumptions, utilizing population-level data, minimizing parametric assumptions, and collectively analyzing multiple pollutants. To reduce model-dependence, our approach estimates cumulative health impacts in the subset of regions with projected no-regulation features lying within the support of the observed with-regulation data, thereby providing a conservative but data-driven assessment to complement traditional parametric approaches. We analyze the health impacts of the CAAA in the US Medicare population in the year 2000, and our estimates suggest that large numbers of cardiovascular and dementia-related hospitalizations were avoided due to CAAA-attributable changes in pollution exposure. Journal: Journal of the American Statistical Association Pages: 1128-1139 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1803883 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1803883 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1128-1139 Template-Type: ReDIF-Article 1.0 Author-Name: Youjin Lee Author-X-Name-First: Youjin Author-X-Name-Last: Lee Author-Name: Elizabeth L. Ogburn Author-X-Name-First: Elizabeth L. Author-X-Name-Last: Ogburn Title: Network Dependence Can Lead to Spurious Associations and Invalid Inference Abstract: Researchers across the health and social sciences generally assume that observations are independent, even while relying on convenience samples that draw subjects from one or a small number of communities, schools, hospitals, etc. A paradigmatic example of this is the Framingham Heart Study (FHS). Many of the limitations of such samples are well-known, but the issue of statistical dependence due to social network ties has not previously been addressed. We show that, along with anticonservative variance estimation, this can result in spurious associations due to network dependence. Using a statistical test that we adapted from one developed for spatial autocorrelation, we test for network dependence in several of the thousands of influential papers that have been published using FHS data. Results suggest that some of the many decades of research on coronary heart disease, other health outcomes, and peer influence using FHS data may suffer from spurious associations, error-prone point estimates, and anticonservative inference due to unacknowledged network dependence. These issues are not unique to the FHS; as researchers in psychology, medicine, and beyond grapple with replication failures, this unacknowledged source of invalid statistical inference should be part of the conversation. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1060-1074 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1782219 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1782219 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1060-1074 Template-Type: ReDIF-Article 1.0 Author-Name: Ross L. Prentice Author-X-Name-First: Ross L. Author-X-Name-Last: Prentice Author-Name: Shanshan Zhao Author-X-Name-First: Shanshan Author-X-Name-Last: Zhao Title: Regression Models and Multivariate Life Tables Abstract: Semiparametric, multiplicative-form regression models are specified for marginal single and double failure hazard rates for the regression analysis of multivariate failure time data. Cox-type estimating functions are specified for single and double failure hazard ratio parameter estimation, and corresponding Aalen–Breslow estimators are specified for baseline hazard rates. Generalization to allow classification of failure times into a smaller set of failure types, with failures of the same type having common baseline hazard functions, is also included. Asymptotic distribution theory arises by generalization of the marginal single failure hazard rate estimation results of Lin et al. The Péano series representation for the bivariate survival function in terms of corresponding marginal single and double failure hazard rates leads to novel estimators for pairwise bivariate survival functions and pairwise dependency functions, at specified covariate history. Related asymptotic distribution theory follows from that for the marginal single and double failure hazard rates and the continuity, compact differentiability of the Péano series transformation and bootstrap applicability. Simulation evaluation of the proposed estimation procedures is presented, and an application to multiple clinical outcomes in the Women’s Health Initiative Dietary Modification Trial is provided. Higher dimensional marginal hazard rate regression modeling is briefly mentioned. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1330-1345 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1713792 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1713792 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1330-1345 Template-Type: ReDIF-Article 1.0 Author-Name: Janice L. Scealy Author-X-Name-First: Janice L. Author-X-Name-Last: Scealy Author-Name: Andrew T. A. Wood Author-X-Name-First: Andrew T. A. Author-X-Name-Last: Wood Title: Analogues on the Sphere of the Affine-Equivariant Spatial Median Abstract: Robust estimation of location for data on the unit sphere Sp−1 is an important problem in directional statistics even though the influence functions of the sample mean direction and other location estimators are bounded. A significant limitation of previous literature on this topic is that robust estimators and procedures have been developed under the assumption that the underlying population is rotationally symmetric. This assumption often does not hold with real data and in these cases there is a needless loss of efficiency in the estimator. In this article, we propose two estimators for spherical data, both of which are analogous to the affine-equivariant spatial median in Euclidean space. The influence functions of the new location estimators are obtained under a new semiparametric elliptical symmetry model on the sphere and are shown to be standardized bias robust in the highly concentrated case; the influence function of the companion scatter matrix is also obtained. An iterative algorithm that computes both estimators is described. Asymptotic results, including consistency and asymptotic normality, are also derived for the location estimators that result from applying a fixed number of steps in this algorithm. Numerical studies demonstrate that both location estimators may be expected to perform well in practice in terms of efficiency and robustness. A brief example application from the geophysics literature is also provided. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1457-1471 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1733582 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1733582 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1457-1471 Template-Type: ReDIF-Article 1.0 Author-Name: Federico Ferrari Author-X-Name-First: Federico Author-X-Name-Last: Ferrari Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Bayesian Factor Analysis for Inference on Interactions Abstract: Abstract–This article is motivated by the problem of inference on interactions among chemical exposures impacting human health outcomes. Chemicals often co-occur in the environment or in synthetic mixtures and as a result exposure levels can be highly correlated. We propose a latent factor joint model, which includes shared factors in both the predictor and response components while assuming conditional independence. By including a quadratic regression in the latent variables in the response component, we induce flexible dimension reduction in characterizing main effects and interactions. We propose a Bayesian approach to inference under this factor analysis for interactions (FIN) framework. Through appropriate modifications of the factor modeling structure, FIN can accommodate higher order interactions. We evaluate the performance using a simulation study and data from the National Health and Nutrition Examination Survey. Code is available on GitHub. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1521-1532 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1745813 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1745813 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1521-1532 Template-Type: ReDIF-Article 1.0 Author-Name: Md Kamrul Hasan Khan Author-X-Name-First: Md Kamrul Hasan Author-X-Name-Last: Khan Author-Name: Avishek Chakraborty Author-X-Name-First: Avishek Author-X-Name-Last: Chakraborty Author-Name: Giovanni Petris Author-X-Name-First: Giovanni Author-X-Name-Last: Petris Author-Name: Barry T. Wilson Author-X-Name-First: Barry T. Author-X-Name-Last: Wilson Title: Constrained Functional Regression of National Forest Inventory Data Over Time Using Remote Sensing Observations Abstract: The USDA Forest Service uses satellite imagery, along with a sample of national forest inventory field plots, to monitor and predict changes in forest conditions over time throughout the United States. We specifically focus on a 230,400 ha region in north-central Wisconsin between 2003 and 2012. The auxiliary data from the satellite imagery of this region are relatively dense in space and time, and can be used to learn how forest conditions changed over that decade. However, these records have a significant proportion of missing values due to weather conditions and system failures that we fill in first using a spatiotemporal model. Subsequently, we use the complete imagery as functional predictors in a two-component mixture model to capture the spatial variation in yearly average live tree basal area, an attribute of interest measured on field plots. We further modify the regression equation to accommodate a biophysical constraint on how plot-level live tree basal area can change from one year to the next. Findings from our analysis, represented with a series of maps, match known spatial patterns across the landscape. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1168-1180 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1860769 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1860769 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1168-1180 Template-Type: ReDIF-Article 1.0 Author-Name: Masayo Y. Hirose Author-X-Name-First: Masayo Y. Author-X-Name-Last: Hirose Author-Name: Partha Lahiri Author-X-Name-First: Partha Author-X-Name-Last: Lahiri Title: Multi-Goal Prior Selection: A Way to Reconcile Bayesian and Classical Approaches for Random Effects Models Abstract: Abstract–The two-level normal hierarchical model has played an important role in statistical theory and applications. In this article, we first introduce a general adjusted maximum likelihood method for estimating the unknown variance component of the model and the associated empirical best linear unbiased predictor of the random effects. We then discuss a new idea for selecting prior for the hyperparameters. The prior, called a multi-goal prior, produces Bayesian solutions for hyperparmeters and random effects that match (in the higher order asymptotic sense) the corresponding classical solution in linear mixed model with respect to several properties. Moreover, we establish for the first time an analytical equivalence of the posterior variances under the proposed multi-goal prior and the corresponding parametric bootstrap second-order mean squared error estimates in the context of a random effects model. Journal: Journal of the American Statistical Association Pages: 1487-1497 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1737532 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1737532 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1487-1497 Template-Type: ReDIF-Article 1.0 Author-Name: Pierre E. Jacob Author-X-Name-First: Pierre E. Author-X-Name-Last: Jacob Author-Name: Ruobin Gong Author-X-Name-First: Ruobin Author-X-Name-Last: Gong Author-Name: Paul T. Edlefsen Author-X-Name-First: Paul T. Author-X-Name-Last: Edlefsen Author-Name: Arthur P. Dempster Author-X-Name-First: Arthur P. Author-X-Name-Last: Dempster Title: Rejoinder—A Gibbs Sampler for a Class of Random Convex Polytopes Abstract: We are very grateful to all commenters for their stimulating remarks, questions, as well as useful pointers to the literature which span a wide range of statistical methods over decades of research. We have neither the space nor the knowledge to answer many of the questions raised, and we only aim to offer some clarifications. We hope that readers will be as enthusiastic as ourselves about research on the topics discussed by the commenters. In the following, we refer to Diaconis and Wang as DW, Hoffman, Hannig and Zhang as HHZ, Lawrence and Vander Wiel as LV, Ruggeri as R, Shafer as S, and Williams as W. Journal: Journal of the American Statistical Association Pages: 1211-1214 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2021.1945458 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1945458 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1211-1214 Template-Type: ReDIF-Article 1.0 Author-Name: Yanxi Hou Author-X-Name-First: Yanxi Author-X-Name-Last: Hou Author-Name: Xing Wang Author-X-Name-First: Xing Author-X-Name-Last: Wang Title: Extreme and Inference for Tail Gini Functionals With Applications in Tail Risk Measurement Abstract: Abstract–Tail risk analysis focuses on the problem of risk measurement on the tail regions of financial variables. As one crucial task in tail risk analysis for risk management, the measurement of tail risk variability is less addressed in the literature. Neither the theoretical results nor inference methods are fully developed, which results in the difficulty of modeling implementation. Practitioners are then short of measurement methods to understand and evaluate tail risks, even when they have large amounts of valuable data in hand. In this article, we consider the measurement of tail variability under the tail scenarios of a systemic variable by extending the Gini’s methodology. As we are very interested in the limit of the proposed measures as the risk level approaches to the extreme status, we showed, by using extreme value techniques, how the tail dependence structure and marginal risk severity have influences on the limit of the proposed tail variability measures. We construct a nonparametric estimator, and its asymptotic behavior is explored. Furthermore, to provide practitioners with more measures for tail risk, we construct three coefficients/measures for tail risks from different views toward tail risks and illustrate them in a real data analysis. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1428-1443 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1730855 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1730855 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1428-1443 Template-Type: ReDIF-Article 1.0 Author-Name: Kara E. Rudolph Author-X-Name-First: Kara E. Author-X-Name-Last: Rudolph Author-Name: Oleg Sofrygin Author-X-Name-First: Oleg Author-X-Name-Last: Sofrygin Author-Name: Mark J. van der Laan Author-X-Name-First: Mark J. Author-X-Name-Last: van der Laan Title: Complier Stochastic Direct Effects: Identification and Robust Estimation Abstract: Mediation analysis is critical to understanding the mechanisms underlying exposure-outcome relationships. In this article, we identify the instrumental variable-direct effect of the exposure on the outcome not through the mediator, using randomization of the instrument. We call this estimand the complier stochastic direct effect (CSDE). To our knowledge, such an estimand has not previously been considered or estimated. We propose and evaluate several estimators for the CSDE: a ratio of inverse-probability of treatment-weighted estimators (IPTW), a ratio of estimating equation estimators (EE), a ratio of targeted minimum loss-based estimators (TMLE), and a TMLE that targets the CSDE directly. These estimators are applicable for a variety of study designs, including randomized encouragement trials, like the Moving to Opportunity housing voucher experiment we consider as an illustrative example, treatment discontinuities, and Mendelian randomization. We found the IPTW estimator to be the most sensitive to finite sample bias, resulting in bias of over 40% even when all models were correctly specified in a sample size of N = 100. In contrast, the EE estimator and TMLE that targets the CSDE directly were far less sensitive. The EE and TML estimators also have advantages in terms of efficiency and reduced reliance on correct parametric model specification. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1254-1264 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2019.1704292 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1704292 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1254-1264 Template-Type: ReDIF-Article 1.0 Author-Name: Fabrizio Ruggeri Author-X-Name-First: Fabrizio Author-X-Name-Last: Ruggeri Title: Comment on “A Gibbs Sampler for a Class of Random Convex Polytopes” by P.E. Jacob, R. Gong, P.T. Edlefsen and A.P. Dempster Journal: Journal of the American Statistical Association Pages: 1204-1205 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2021.1946404 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1946404 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1204-1205 Template-Type: ReDIF-Article 1.0 Author-Name: Kentaro Hoffman Author-X-Name-First: Kentaro Author-X-Name-Last: Hoffman Author-Name: Jan Hannig Author-X-Name-First: Jan Author-X-Name-Last: Hannig Author-Name: Kai Zhang Author-X-Name-First: Kai Author-X-Name-Last: Zhang Title: Comments on “A Gibbs Sampler for a Class of Random Convex Polytopes” Journal: Journal of the American Statistical Association Pages: 1206-1210 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2021.1950002 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1950002 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1206-1210 Template-Type: ReDIF-Article 1.0 Author-Name: Xialiang Dou Author-X-Name-First: Xialiang Author-X-Name-Last: Dou Author-Name: Tengyuan Liang Author-X-Name-First: Tengyuan Author-X-Name-Last: Liang Title: Training Neural Networks as Learning Data-adaptive Kernels: Provable Representation and Approximation Benefits Abstract: Consider the problem: given the data pair (x,y) drawn from a population with f*(x)=E[y|x=x] , specify a neural network model and run gradient flow on the weights over time until reaching any stationarity. How does ft , the function computed by the neural network at time t, relate to f* , in terms of approximation and representation? What are the provable benefits of the adaptive representation by neural networks compared to the pre-specified fixed basis representation in the classical nonparametric literature? We answer the above questions via a dynamic reproducing kernel Hilbert space (RKHS) approach indexed by the training process of neural networks. Firstly, we show that when reaching any local stationarity, gradient flow learns an adaptive RKHS representation and performs the global least-squares projection onto the adaptive RKHS, simultaneously. Secondly, we prove that as the RKHS is data-adaptive and task-specific, the residual for f* lies in a subspace that is potentially much smaller than the orthogonal complement of the RKHS. The result formalizes the representation and approximation benefits of neural networks. Lastly, we show that the neural network function computed by gradient flow converges to the kernel ridgeless regression with an adaptive kernel, in the limit of vanishing regularization. The adaptive kernel viewpoint provides new angles of studying the approximation, representation, generalization, and optimization advantages of neural networks. Journal: Journal of the American Statistical Association Pages: 1507-1520 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2020.1745812 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1745812 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1507-1520 Template-Type: ReDIF-Article 1.0 Author-Name: Xiwei Tang Author-X-Name-First: Xiwei Author-X-Name-Last: Tang Author-Name: Fei Xue Author-X-Name-First: Fei Author-X-Name-Last: Xue Author-Name: Annie Qu Author-X-Name-First: Annie Author-X-Name-Last: Qu Title: Individualized Multidirectional Variable Selection Abstract: In this article, we propose a heterogeneous modeling framework which achieves individual-wise feature selection and heterogeneous covariates’ effects subgrouping simultaneously. In contrast to conventional model selection approaches, the new approach constructs a separation penalty with multidirectional shrinkages, which facilitates individualized modeling to distinguish strong signals from noisy ones and selects different relevant variables for different individuals. Meanwhile, the proposed model identifies subgroups among which individuals share similar covariates’ effects, and thus improves individualized estimation efficiency and feature selection accuracy. Moreover, the proposed model also incorporates within-individual correlation for longitudinal data to gain extra efficiency. We provide a general theoretical foundation under a double-divergence modeling framework where the number of individuals and the number of individual-wise measurements can both diverge, which enables inference on both an individual level and a population level. In particular, we establish a strong oracle property for the individualized estimator to ensure its optimal large sample property under various conditions. An efficient ADMM algorithm is developed for computational scalability. Simulation studies and applications to post-trauma mental disorder analysis with genetic variation and an HIV longitudinal treatment study are illustrated to compare the new approach to existing methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1280-1296 Issue: 535 Volume: 116 Year: 2021 Month: 7 X-DOI: 10.1080/01621459.2019.1705308 File-URL: http://hdl.handle.net/10.1080/01621459.2019.1705308 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1280-1296 Template-Type: ReDIF-Article 1.0 Author-Name: Fangzheng Xie Author-X-Name-First: Fangzheng Author-X-Name-Last: Xie Author-Name: Yanxun Xu Author-X-Name-First: Yanxun Author-X-Name-Last: Xu Title: Bayesian Projected Calibration of Computer Models Abstract: We develop a Bayesian approach called the Bayesian projected calibration to address the problem of calibrating an imperfect computer model using observational data from an unknown complex physical system. The calibration parameter and the physical system are parameterized in an identifiable fashion via the L2-projection. The physical system is imposed a Gaussian process prior distribution, which naturally induces a prior distribution on the calibration parameter through the L2-projection constraint. The calibration parameter is estimated through its posterior distribution, serving as a natural and nonasymptotic approach for the uncertainty quantification. We provide rigorous large sample justifications of the proposed approach by establishing the asymptotic normality of the posterior of the calibration parameter with the efficient covariance matrix. In addition to the theoretical analysis, two convenient computational algorithms based on stochastic approximation are designed with strong theoretical support. Through extensive simulation studies and the analyses of two real-world datasets, we show that the proposed Bayesian projected calibration can accurately estimate the calibration parameters, calibrate the computer models well, and compare favorably to alternative approaches. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1965-1982 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1753519 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1753519 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1965-1982 Template-Type: ReDIF-Article 1.0 Author-Name: DongHyuk Lee Author-X-Name-First: DongHyuk Author-X-Name-Last: Lee Author-Name: Bin Zhu Author-X-Name-First: Bin Author-X-Name-Last: Zhu Title: A Semiparametric Kernel Independence Test With Application to Mutational Signatures Abstract: Cancers arise owing to somatic mutations, and the characteristic combinations of somatic mutations form mutational signatures. Despite many mutational signatures being identified, mutational processes underlying a number of mutational signatures remain unknown, which hinders the identification of interventions that may reduce somatic mutation burdens and prevent the development of cancer. We demonstrate that the unknown cause of a mutational signature can be inferred by the associated signatures with known etiology. However, existing association tests are not statistically powerful due to excess zeros in mutational signatures data. To address this limitation, we propose a semiparametric kernel independence test (SKIT). The SKIT statistic is defined as the integrated squared distance between mixed probability distributions and is decomposed into four disjoint components to pinpoint the source of dependency. We derive the asymptotic null distribution and prove the asymptotic convergence of power. Due to slow convergence to the asymptotic null distribution, a bootstrap method is employed to compute p-values. Simulation studies demonstrate that when zeros are prevalent, SKIT is more resilient to power loss than existing tests and robust to random errors. We applied SKIT to The Cancer Genome Atlas mutational signatures data for over 9000 tumors across 32 cancer types, and identified a novel association between signature 17 curated in the Catalogue of Somatic Mutations in Cancer and apolipoprotein B mRNA editing enzyme (APOBEC) signatures in gastrointestinal cancers. It indicates that APOBEC activity is likely associated with the unknown cause of signature 17. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1648-1661 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1871357 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1871357 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1648-1661 Template-Type: ReDIF-Article 1.0 Author-Name: Qihui Su Author-X-Name-First: Qihui Author-X-Name-Last: Su Author-Name: Zhongling Qin Author-X-Name-First: Zhongling Author-X-Name-Last: Qin Author-Name: Liang Peng Author-X-Name-First: Liang Author-X-Name-Last: Peng Author-Name: Gengsheng Qin Author-X-Name-First: Gengsheng Author-X-Name-Last: Qin Title: Efficiently Backtesting Conditional Value-at-Risk and Conditional Expected Shortfall Abstract: Abstract–Given the importance of backtesting risk models and forecasts for financial institutions and regulators, we develop an efficient empirical likelihood backtest for either conditional value-at-risk or conditional expected shortfall when the given risk variable is modeled by an ARMA-GARCH process. Using a two-step procedure, the proposed backtests require less finite moments than existing backtests, allowing for robustness to heavier tails. Furthermore, we add a constraint on the goodness of fit of the error distribution to provide more accurate risk forecasts and improved test power. A simulation study confirms the good finite sample performance of the new backtests, and empirical analyses demonstrate the usefulness of these efficient backtests for monitoring financial crises. Journal: Journal of the American Statistical Association Pages: 2041-2052 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1763804 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1763804 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:2041-2052 Template-Type: ReDIF-Article 1.0 Author-Name: Xueyu Mao Author-X-Name-First: Xueyu Author-X-Name-Last: Mao Author-Name: Purnamrita Sarkar Author-X-Name-First: Purnamrita Author-X-Name-Last: Sarkar Author-Name: Deepayan Chakrabarti Author-X-Name-First: Deepayan Author-X-Name-Last: Chakrabarti Title: Estimating Mixed Memberships With Sharp Eigenvector Deviations Abstract: We consider the problem of estimating community memberships of nodes in a network, where every node is associated with a vector determining its degree of membership in each community. Existing provably consistent algorithms often require strong assumptions about the population, are computationally expensive, and only provide an overall error bound for the whole community membership matrix. This article provides uniform rates of convergence for the inferred community membership vector of each node in a network generated from the mixed membership stochastic blockmodel (MMSB); to our knowledge, this is the first work to establish per-node rates for overlapping community detection in networks. We achieve this by establishing sharp row-wise eigenvector deviation bounds for MMSB. Based on the simplex structure inherent in the eigen-decomposition of the population matrix, we build on established corner-finding algorithms from the optimization community to infer the community membership vectors. Our results hold over a broad parameter regime where the average degree only grows poly-logarithmically with the number of nodes. Using experiments with simulated and real datasets, we show that our method achieves better error with lower variability over competing methods, and processes real world networks of up to 100,000 nodes within tens of seconds. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1928-1940 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1751645 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1751645 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1928-1940 Template-Type: ReDIF-Article 1.0 Author-Name: Fan Zhou Author-X-Name-First: Fan Author-X-Name-Last: Zhou Author-Name: Shikai Luo Author-X-Name-First: Shikai Author-X-Name-Last: Luo Author-Name: Xiaohu Qie Author-X-Name-First: Xiaohu Author-X-Name-Last: Qie Author-Name: Jieping Ye Author-X-Name-First: Jieping Author-X-Name-Last: Ye Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Title: Graph-Based Equilibrium Metrics for Dynamic Supply–Demand Systems With Applications to Ride-sourcing Platforms Abstract: How to dynamically measure the local-to-global spatio-temporal coherence between demand and supply networks is a fundamental task for ride-sourcing platforms, such as DiDi. Such coherence measurement is critically important for the quantification of the market efficiency and the comparison of different platform policies, such as dispatching. The aim of this paper is to introduce a graph-based equilibrium metric (GEM) to quantify the distance between demand and supply networks based on a weighted graph structure. We formulate GEM as the optimal objective value of an unbalanced optimal transport problem, which can be formulated as an equivalent linear programming and efficiently solved. We examine how the GEM can help solve three operational tasks of ride-sourcing platforms. The first one is that GEM achieves up to 70.6% reduction in root-mean-square error over the second-best distance measurement for the prediction accuracy of order answer rate. The second one is that the use of GEM for designing order dispatching policy increases drivers’ revenue for more than 1%, representing a huge improvement in number. The third one is that GEM can serve as an endpoint for comparing different platform policies in AB test. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1688-1699 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1898409 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1898409 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1688-1699 Template-Type: ReDIF-Article 1.0 Author-Name: Natalie Dean Author-X-Name-First: Natalie Author-X-Name-Last: Dean Author-Name: Yang Yang Author-X-Name-First: Yang Author-X-Name-Last: Yang Title: Discussion of “Regression Models for Understanding COVID-19 Epidemic Dynamics With Incomplete Data” Journal: Journal of the American Statistical Association Pages: 1587-1590 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1982722 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1982722 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1587-1590 Template-Type: ReDIF-Article 1.0 Author-Name: Sangwook Kang Author-X-Name-First: Sangwook Author-X-Name-Last: Kang Title: Advanced Survival Models Journal: Journal of the American Statistical Association Pages: 2098-2099 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1997014 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1997014 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:2098-2099 Template-Type: ReDIF-Article 1.0 Author-Name: Yutong Li Author-X-Name-First: Yutong Author-X-Name-Last: Li Author-Name: Ruoqing Zhu Author-X-Name-First: Ruoqing Author-X-Name-Last: Zhu Author-Name: Annie Qu Author-X-Name-First: Annie Author-X-Name-Last: Qu Author-Name: Han Ye Author-X-Name-First: Han Author-X-Name-Last: Ye Author-Name: Zhankun Sun Author-X-Name-First: Zhankun Author-X-Name-Last: Sun Title: Topic Modeling on Triage Notes With Semiorthogonal Nonnegative Matrix Factorization Abstract: Emergency department (ED) crowding is a universal health issue that affects the efficiency of hospital management and patient care quality. ED crowding frequently occurs when a request for a ward-bed for a patient is delayed until a doctor makes an admission decision. In this case study, we build a classifier to predict the disposition of patients using manually typed nurse notes collected during triage as provided by the Alberta Medical Center. These predictions can potentially be incorporated to early bed coordination and fast track streaming strategies to alleviate overcrowding and waiting times in the ED. However, these triage notes involve high dimensional, noisy, and sparse text data, which make model-fitting and interpretation difficult. To address this issue, we propose a novel semiorthogonal nonnegative matrix factorization for both continuous and binary predictors to reduce the dimensionality and derive word topics. The triage notes can then be interpreted as a non-subtractive linear combination of orthogonal basis topic vectors. Our real data analysis shows that the triage notes contain strong predictive information toward classifying the disposition of patients for certain medical complaints, such as altered consciousness or stroke. Additionally, we show that the document-topic vectors generated by our method can be used as features to further improve classification accuracy by up to 1% across different medical complaints, for example, 74.3%–75.3% accuracy for patients with stroke symptoms. This improvement could be clinically impactful for certain patients, especially when the scale of hospital patients is large. Furthermore, the generated word-topic vectors provide a bi-clustering interpretation under each topic due to the orthogonal formulation, which can be beneficial for hospitals in better understanding the symptoms and reasons behind patients’ visits. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1609-1624 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1862667 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1862667 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1609-1624 Template-Type: ReDIF-Article 1.0 Author-Name: Maxwell Kellogg Author-X-Name-First: Maxwell Author-X-Name-Last: Kellogg Author-Name: Magne Mogstad Author-X-Name-First: Magne Author-X-Name-Last: Mogstad Author-Name: Guillaume A. Pouliot Author-X-Name-First: Guillaume A. Author-X-Name-Last: Pouliot Author-Name: Alexander Torgovitsky Author-X-Name-First: Alexander Author-X-Name-Last: Torgovitsky Title: Combining Matching and Synthetic Control to Tradeoff Biases From Extrapolation and Interpolation Abstract: The synthetic control (SC) method is widely used in comparative case studies to adjust for differences in pretreatment characteristics. SC limits extrapolation bias at the potential expense of interpolation bias, whereas traditional matching estimators have the opposite properties. This complementarity motives us to propose a matching and synthetic control (or MASC) estimator as a model averaging estimator that combines the standard SC and matching estimators. We show how to use a rolling-origin cross-validation procedure to train the MASC to resolve tradeoffs between interpolation and extrapolation bias. We use a series of empirically based placebo and Monte Carlo simulations to shed light on when the SC, matching, MASC and penalized SC estimators do (and do not) perform well. Then, we apply these estimators to examine the economic costs of conflicts in the context of Spain. Journal: Journal of the American Statistical Association Pages: 1804-1816 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1979562 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1979562 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1804-1816 Template-Type: ReDIF-Article 1.0 Author-Name: Junhyung Park Author-X-Name-First: Junhyung Author-X-Name-Last: Park Author-Name: Frederic Paik Schoenberg Author-X-Name-First: Frederic Paik Author-X-Name-Last: Schoenberg Author-Name: Andrea L. Bertozzi Author-X-Name-First: Andrea L. Author-X-Name-Last: Bertozzi Author-Name: P. Jeffrey Brantingham Author-X-Name-First: P. Jeffrey Author-X-Name-Last: Brantingham Title: Investigating Clustering and Violence Interruption in Gang-Related Violent Crime Data Using Spatial–Temporal Point Processes With Covariates Abstract: Reported gang-related violent crimes in Los Angeles, California, from 1/1/14 to 12/31/17 are modeled using spatial–temporal marked Hawkes point processes with covariates. We propose an algorithm to estimate the spatial-temporally varying background rate nonparametrically as a function of demographic covariates. Kernel smoothing and generalized additive models are used in an attempt to model the background rate as closely as possible in an effort to differentiate inhomogeneity in the background rate from causal clustering or triggering of events. The models are fit to data from 2014 to 2016 and evaluated using data from 2017, based on log-likelihood and superthinned residuals. The impact of nonrandomized violence interruption performed by The City of Los Angeles Mayor’s Office of Gang Reduction Youth Development (GRYD) Incident Response (IR) Program is assessed by comparing the triggering associated with GRYD IR Program events to the triggering associated with sub-sampled non-GRYD events selected to have a similar spatial–temporal distribution. The results suggest that GRYD IR Program violence interruption yields a reduction of approximately 18.3% in the retaliation rate in locations more than 130 m from the original reported crimes, and a reduction of 14.2% in retaliations within 130 m. Journal: Journal of the American Statistical Association Pages: 1674-1687 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1898408 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1898408 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1674-1687 Template-Type: ReDIF-Article 1.0 Author-Name: Jushan Bai Author-X-Name-First: Jushan Author-X-Name-Last: Bai Author-Name: Serena Ng Author-X-Name-First: Serena Author-X-Name-Last: Ng Title: Matrix Completion, Counterfactuals, and Factor Analysis of Missing Data Abstract: This article proposes an imputation procedure that uses the factors estimated from a tall block along with the re-rotated loadings estimated from a wide block to impute missing values in a panel of data. Assuming that a strong factor structure holds for the full panel of data and its sub-blocks, it is shown that the common component can be consistently estimated at four different rates of convergence without requiring regularization or iteration. An asymptotic analysis of the estimation error is obtained. An application of our analysis is estimation of counterfactuals when potential outcomes have a factor structure. We study the estimation of average and individual treatment effects on the treated and establish a normal distribution theory that can be useful for hypothesis testing. Journal: Journal of the American Statistical Association Pages: 1746-1763 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1967163 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1967163 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1746-1763 Template-Type: ReDIF-Article 1.0 Author-Name: Matias D. Cattaneo Author-X-Name-First: Matias D. Author-X-Name-Last: Cattaneo Author-Name: Luke Keele Author-X-Name-First: Luke Author-X-Name-Last: Keele Author-Name: Rocío Titiunik Author-X-Name-First: Rocío Author-X-Name-Last: Titiunik Author-Name: Gonzalo Vazquez-Bare Author-X-Name-First: Gonzalo Author-X-Name-Last: Vazquez-Bare Title: Extrapolating Treatment Effects in Multi-Cutoff Regression Discontinuity Designs Abstract: Abstract–In nonexperimental settings, the regression discontinuity (RD) design is one of the most credible identification strategies for program evaluation and causal inference. However, RD treatment effect estimands are necessarily local, making statistical methods for the extrapolation of these effects a key area for development. We introduce a new method for extrapolation of RD effects that relies on the presence of multiple cutoffs, and is therefore design-based. Our approach employs an easy-to-interpret identifying assumption that mimics the idea of “common trends” in difference-in-differences designs. We illustrate our methods with data on a subsidized loan program on post-education attendance in Colombia, and offer new evidence on program effects for students with test scores away from the cutoff that determined program eligibility. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1941-1952 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1751646 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1751646 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1941-1952 Template-Type: ReDIF-Article 1.0 Author-Name: Azeem M. Shaikh Author-X-Name-First: Azeem M. Author-X-Name-Last: Shaikh Author-Name: Panos Toulis Author-X-Name-First: Panos Author-X-Name-Last: Toulis Title: Randomization Tests in Observational Studies With Staggered Adoption of Treatment Abstract: This article considers the problem of inference in observational studies with time-varying adoption of treatment. In addition to an unconfoundedness assumption that the potential outcomes are independent of the times at which units adopt treatment conditional on the units’ observed characteristics, our analysis assumes that the time at which each unit adopts treatment follows a Cox proportional hazards model. This assumption permits the time at which each unit adopts treatment to depend on the observed characteristics of the unit, but imposes the restriction that the probability of multiple units adopting treatment at the same time is zero. In this context, we study randomization tests of a null hypothesis that specifies that there is no treatment effect for all units and all time periods in a distributional sense. We first show that an infeasible test that treats the parameters of the Cox model as known has rejection probability under the null hypothesis no greater than the nominal level in finite samples. Since these parameters are unknown in practice, this result motivates a feasible test that replaces these parameters with consistent estimators. While the resulting test does not need to have the same finite-sample validity as the infeasible test, we show that it has limiting rejection probability under the null hypothesis no greater than the nominal level. In a simulation study, we examine the practical relevance of our theoretical results, including robustness to misspecification of the model for the time at which each unit adopts treatment. Finally, we provide an empirical application of our methodology using the synthetic control-based test statistic and tobacco legislation data found in Abadie, Diamond and Hainmueller. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1835-1848 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1974458 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1974458 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1835-1848 Template-Type: ReDIF-Article 1.0 Author-Name: Susan Athey Author-X-Name-First: Susan Author-X-Name-Last: Athey Author-Name: Mohsen Bayati Author-X-Name-First: Mohsen Author-X-Name-Last: Bayati Author-Name: Nikolay Doudchenko Author-X-Name-First: Nikolay Author-X-Name-Last: Doudchenko Author-Name: Guido Imbens Author-X-Name-First: Guido Author-X-Name-Last: Imbens Author-Name: Khashayar Khosravi Author-X-Name-First: Khashayar Author-X-Name-Last: Khosravi Title: Matrix Completion Methods for Causal Panel Data Models Abstract: In this article, we study methods for estimating causal effects in settings with panel data, where some units are exposed to a treatment during some periods and the goal is estimating counterfactual (untreated) outcomes for the treated unit/period combinations. We propose a class of matrix completion estimators that uses the observed elements of the matrix of control outcomes corresponding to untreated unit/periods to impute the “missing” elements of the control outcome matrix, corresponding to treated units/periods. This leads to a matrix that well-approximates the original (incomplete) matrix, but has lower complexity according to the nuclear norm for matrices. We generalize results from the matrix completion literature by allowing the patterns of missing data to have a time series dependency structure that is common in social science applications. We present novel insights concerning the connections between the matrix completion literature, the literature on interactive fixed effects models and the literatures on program evaluation under unconfoundedness and synthetic control methods. We show that all these estimators can be viewed as focusing on the same objective function. They differ solely in the way they deal with identification, in some cases solely through regularization (our proposed nuclear norm matrix completion estimator) and in other cases primarily through imposing hard restrictions (the unconfoundedness and synthetic control approaches). The proposed method outperforms unconfoundedness-based or synthetic control estimators in simulations based on real data. Journal: Journal of the American Statistical Association Pages: 1716-1730 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1891924 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1891924 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1716-1730 Template-Type: ReDIF-Article 1.0 Author-Name: Xu Shi Author-X-Name-First: Xu Author-X-Name-Last: Shi Author-Name: Xiaoou Li Author-X-Name-First: Xiaoou Author-X-Name-Last: Li Author-Name: Tianxi Cai Author-X-Name-First: Tianxi Author-X-Name-Last: Cai Title: Spherical Regression Under Mismatch Corruption With Application to Automated Knowledge Translation Abstract: Motivated by a series of applications in data integration, language translation, bioinformatics, and computer vision, we consider spherical regression with two sets of unit-length vectors when the data are corrupted by a small fraction of mismatch in the response-predictor pairs. We propose a three-step algorithm in which we initialize the parameters by solving an orthogonal Procrustes problem to estimate a translation matrix W ignoring the mismatch. We then estimate a mapping matrix aiming to correct the mismatch using hard-thresholding to induce sparsity, while incorporating potential group information. We eventually obtain a refined estimate for W by removing the estimated mismatched pairs. We derive the error bound for the initial estimate of W in both fixed and high-dimensional setting. We demonstrate that the refined estimate of W achieves an error rate that is as good as if no mismatch is present. We show that our mapping recovery method not only correctly distinguishes one-to-one and one-to-many correspondences, but also consistently identifies the matched pairs and estimates the weight vector for combined correspondence. We examine the finite sample performance of the proposed method via extensive simulation studies, and with application to the unsupervised translation of medical codes using electronic health records data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1953-1964 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1752219 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1752219 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1953-1964 Template-Type: ReDIF-Article 1.0 Author-Name: Corbin Quick Author-X-Name-First: Corbin Author-X-Name-Last: Quick Author-Name: Rounak Dey Author-X-Name-First: Rounak Author-X-Name-Last: Dey Author-Name: Xihong Lin Author-X-Name-First: Xihong Author-X-Name-Last: Lin Title: Regression Models for Understanding COVID-19 Epidemic Dynamics With Incomplete Data Abstract: Modeling infectious disease dynamics has been critical throughout the COVID-19 pandemic. Of particular interest are the incidence, prevalence, and effective reproductive number (Rt ). Estimating these quantities is challenging due to under-ascertainment, unreliable reporting, and time lags between infection, onset, and testing. We propose a Multilevel Epidemic Regression Model to Account for Incomplete Data (MERMAID) to jointly estimate Rt , ascertainment rates, incidence, and prevalence over time in one or multiple regions. Specifically, MERMAID allows for a flexible regression model of Rt that can incorporate geographic and time-varying covariates. To account for under-ascertainment, we (a) model the ascertainment probability over time as a function of testing metrics and (b) jointly model data on confirmed infections and population-based serological surveys. To account for delays between infection, onset, and reporting, we model stochastic lag times as missing data, and develop an EM algorithm to estimate the model parameters. We evaluate the performance of MERMAID in simulation studies, and assess its robustness by conducting sensitivity analyses in a range of scenarios of model misspecifications. We apply the proposed method to analyze COVID-19 daily confirmed infection counts, PCR testing data, and serological survey data across the United States. Based on our model, we estimate an overall COVID-19 prevalence of 12.5% (ranging from 2.4% in Maine to 20.2% in New York) and an overall ascertainment rate of 45.5% (ranging from 22.5% in New York to 81.3% in Rhode Island) in the United States from March to December 2020. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1561-1577 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.2001339 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2001339 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1561-1577 Template-Type: ReDIF-Article 1.0 Author-Name: Alan Julian Izenman Author-X-Name-First: Alan Julian Author-X-Name-Last: Izenman Title: Sampling Algorithms for Discrete Markov Random Fields and Related Graphical Models Abstract: Discrete Markov random fields are undirected graphical models in which the nodes of a graph are discrete random variables with values usually represented by colors. Typically, graphs are taken to be square lattices, although more general graphs are also of interest. Such discrete MRFs have been studied in many disciplines. We describe the two most popular types of discrete MRFs, namely the two-state Ising model and the q-state Potts model, and variations such as the cellular automaton, the cellular Potts model, and the random cluster model, the latter of which is a continuous generalization of both the Ising and Potts models. Research interest is usually focused on providing algorithms for simulating from these models because the partition function is so computationally intractable that statistical inference for the parameters of the appropriate probability distribution becomes very complicated. Substantial improvements to the Metropolis algorithm have appeared in the form of cluster algorithms, such as the Swendsen–Wang and Wolff algorithms. We study the simulation processes of these algorithms, which update the color of a cluster of nodes at each iteration. Journal: Journal of the American Statistical Association Pages: 2065-2086 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1898410 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1898410 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:2065-2086 Template-Type: ReDIF-Article 1.0 Author-Name: Matias D. Cattaneo Author-X-Name-First: Matias D. Author-X-Name-Last: Cattaneo Author-Name: Yingjie Feng Author-X-Name-First: Yingjie Author-X-Name-Last: Feng Author-Name: Rocio Titiunik Author-X-Name-First: Rocio Author-X-Name-Last: Titiunik Title: Prediction Intervals for Synthetic Control Methods Abstract: Uncertainty quantification is a fundamental problem in the analysis and interpretation of synthetic control (SC) methods. We develop conditional prediction intervals in the SC framework, and provide conditions under which these intervals offer finite-sample probability guarantees. Our method allows for covariate adjustment and nonstationary data. The construction begins by noting that the statistical uncertainty of the SC prediction is governed by two distinct sources of randomness: one coming from the construction of the (likely misspecified) SC weights in the pretreatment period, and the other coming from the unobservable stochastic error in the post-treatment period when the treatment effect is analyzed. Accordingly, our proposed prediction intervals are constructed taking into account both sources of randomness. For implementation, we propose a simulation-based approach along with finite-sample-based probability bound arguments, naturally leading to principled sensitivity analysis methods. We illustrate the numerical performance of our methods using empirical applications and a small simulation study. Python, R and Stata software packages implementing our methodology are available. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1865-1880 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1979561 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1979561 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1865-1880 Template-Type: ReDIF-Article 1.0 Author-Name: Andrew Gelman Author-X-Name-First: Andrew Author-X-Name-Last: Gelman Author-Name: Aki Vehtari Author-X-Name-First: Aki Author-X-Name-Last: Vehtari Title: What are the Most Important Statistical Ideas of the Past 50 Years? Abstract: We review the most important statistical ideas of the past half century, which we categorize as: counterfactual causal inference, bootstrapping and simulation-based inference, overparameterized models and regularization, Bayesian multilevel models, generic computation algorithms, adaptive decision analysis, robust inference, and exploratory data analysis. We discuss key contributions in these subfields, how they relate to modern computing and big data, and how they might be developed and extended in future decades. The goal of this article is to provoke thought and discussion regarding the larger themes of research in statistics and data science. Journal: Journal of the American Statistical Association Pages: 2087-2097 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1938081 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1938081 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:2087-2097 Template-Type: ReDIF-Article 1.0 Author-Name: Munir Hiabu Author-X-Name-First: Munir Author-X-Name-Last: Hiabu Author-Name: Enno Mammen Author-X-Name-First: Enno Author-X-Name-Last: Mammen Author-Name: M. Dolores Martínez-Miranda Author-X-Name-First: M. Dolores Author-X-Name-Last: Martínez-Miranda Author-Name: Jens P. Nielsen Author-X-Name-First: Jens P. Author-X-Name-Last: Nielsen Title: Smooth Backfitting of Proportional Hazards With Multiplicative Components Abstract: Smooth backfitting has proven to have a number of theoretical and practical advantages in structured regression. By projecting the data down onto the structured space of interest smooth backfitting provides a direct link between data and estimator. This article introduces the ideas of smooth backfitting to survival analysis in a proportional hazard model, where we assume an underlying conditional hazard with multiplicative components. We develop asymptotic theory for the estimator. In a comprehensive simulation study, we show that our smooth backfitting estimator successfully circumvents the curse of dimensionality and outperforms existing estimators. This is especially the case in difficult situations like high number of covariates and/or high correlation between the covariates, where other estimators tend to break down. We use the smooth backfitter in a practical application where we extend recent advances of in-sample forecasting methodology by allowing more information to be incorporated, while still obeying the structured requirements of in-sample forecasting. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1983-1993 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1753520 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1753520 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1983-1993 Template-Type: ReDIF-Article 1.0 Author-Name: Guillaume Gerber Author-X-Name-First: Guillaume Author-X-Name-Last: Gerber Author-Name: Yohann Le Faou Author-X-Name-First: Yohann Le Author-X-Name-Last: Faou Author-Name: Olivier Lopez Author-X-Name-First: Olivier Author-X-Name-Last: Lopez Author-Name: Michael Trupin Author-X-Name-First: Michael Author-X-Name-Last: Trupin Title: The Impact of Churn on Client Value in Health Insurance, Evaluation Using a Random Forest Under Various Censoring Mechanisms Abstract: Abstract–In the insurance broker market, commissions received by brokers are closely related to so-called “customer value”: the longer a policyholder keeps their contract, the more profit there is for the company and therefore the broker. Hence, predicting the time at which a potential policyholder will surrender their contract is essential to optimize a commercial process and define a prospect scoring. In this article, we propose a weighted random forest model to address this problem. Our model is designed to compensate for the impact of random censoring. We investigate different types of assumptions on the censoring, studying both the cases where it is independent or not from the covariates. We compare our approach with other standard methods which apply in our setting, using simulated and real data analysis. We show that our approach is very competitive in terms of quadratic error in addressing the given problem. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2053-2064 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1764364 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1764364 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:2053-2064 Template-Type: ReDIF-Article 1.0 Author-Name: Jyotishka Datta Author-X-Name-First: Jyotishka Author-X-Name-Last: Datta Author-Name: Bhramar Mukherjee Author-X-Name-First: Bhramar Author-X-Name-Last: Mukherjee Title: Discussion on “Regression Models for Understanding COVID-19 Epidemic Dynamics With Incomplete Data” Journal: Journal of the American Statistical Association Pages: 1583-1586 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1982721 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1982721 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1583-1586 Template-Type: ReDIF-Article 1.0 Author-Name: Huazhang Li Author-X-Name-First: Huazhang Author-X-Name-Last: Li Author-Name: Yaotian Wang Author-X-Name-First: Yaotian Author-X-Name-Last: Wang Author-Name: Guofen Yan Author-X-Name-First: Guofen Author-X-Name-Last: Yan Author-Name: Yinge Sun Author-X-Name-First: Yinge Author-X-Name-Last: Sun Author-Name: Seiji Tanabe Author-X-Name-First: Seiji Author-X-Name-Last: Tanabe Author-Name: Chang-Chia Liu Author-X-Name-First: Chang-Chia Author-X-Name-Last: Liu Author-Name: Mark S. Quigg Author-X-Name-First: Mark S. Author-X-Name-Last: Quigg Author-Name: Tingting Zhang Author-X-Name-First: Tingting Author-X-Name-Last: Zhang Title: A Bayesian State-Space Approach to Mapping Directional Brain Networks Abstract: The human brain is a directional network system of brain regions involving directional connectivity. Seizures are a directional network phenomenon as abnormal neuronal activities start from a seizure onset zone (SOZ) and propagate to otherwise healthy regions. To localize the SOZ of an epileptic patient, clinicians use intracranial electroencephalography (iEEG) to record the patient’s intracranial brain activity in many small regions. iEEG data are high-dimensional multivariate time series. We build a state-space multivariate autoregression (SSMAR) for iEEG data to model the underlying directional brain network. To produce scientifically interpretable network results, we incorporate into the SSMAR the scientific knowledge that the underlying brain network tends to have a cluster structure. Specifically, we assign to the SSMAR parameters a stochastic-blockmodel-motivated prior, which reflects the cluster structure. We develop a Bayesian framework to estimate the SSMAR, infer directional connections, and identify clusters for the unobserved network edges. The new method is robust to violations of model assumptions and outperforms existing network methods. By applying the new method to an epileptic patient’s iEEG data, we reveal seizure initiation and propagation in the patient’s directional brain network and discover a unique directional connectivity property of the SOZ. Overall, the network results obtained in this study bring new insights into epileptic patients’ normal and abnormal epileptic brain mechanisms and have the potential to assist neurologists and clinicians in localizing the SOZ—a long-standing research focus in epilepsy diagnosis and treatment. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1637-1647 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1865985 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1865985 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1637-1647 Template-Type: ReDIF-Article 1.0 Author-Name: Pierre Lafaye de Micheaux Author-X-Name-First: Pierre Lafaye Author-X-Name-Last: de Micheaux Author-Name: Pavlo Mozharovskyi Author-X-Name-First: Pavlo Author-X-Name-Last: Mozharovskyi Author-Name: Myriam Vimond Author-X-Name-First: Myriam Author-X-Name-Last: Vimond Title: Depth for Curve Data and Applications Abstract: In 1975, John W. Tukey defined statistical data depth as a function that determines the centrality of an arbitrary point with respect to a data cloud or to a probability measure. During the last decades, this seminal idea of data depth evolved into a powerful tool proving to be useful in various fields of science. Recently, extending the notion of data depth to the functional setting attracted a lot of attention among theoretical and applied statisticians. We go further and suggest a notion of data depth suitable for data represented as curves, or trajectories, which is independent of the parameterization. We show that our curve depth satisfies theoretical requirements of general depth functions that are meaningful for trajectories. We apply our methodology to diffusion tensor brain images and also to pattern recognition of handwritten digits and letters. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1881-1897 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1745815 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1745815 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1881-1897 Template-Type: ReDIF-Article 1.0 Author-Name: Bruno Ferman Author-X-Name-First: Bruno Author-X-Name-Last: Ferman Title: On the Properties of the Synthetic Control Estimator with Many Periods and Many Controls Abstract: We consider the asymptotic properties of the synthetic control (SC) estimator when both the number of pretreatment periods and control units are large. If potential outcomes follow a linear factor model, we provide conditions under which the SC unit asymptotically recovers the factor structure of the treated unit, even when the pretreatment fit is imperfect. This happens when there are weights diluted among an increasing number of control units such that a weighted average of the factor structure of the control units asymptotically reconstructs the factor structure of the treated unit. In this case, the SC estimator is asymptotically unbiased even when treatment assignment is correlated with time-varying unobservables. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1764-1772 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1965613 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1965613 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1764-1772 Template-Type: ReDIF-Article 1.0 Author-Name: Anish Agarwal Author-X-Name-First: Anish Author-X-Name-Last: Agarwal Author-Name: Devavrat Shah Author-X-Name-First: Devavrat Author-X-Name-Last: Shah Author-Name: Dennis Shen Author-X-Name-First: Dennis Author-X-Name-Last: Shen Author-Name: Dogyoon Song Author-X-Name-First: Dogyoon Author-X-Name-Last: Song Title: On Robustness of Principal Component Regression Abstract: Principal component regression (PCR) is a simple, but powerful and ubiquitously utilized method. Its effectiveness is well established when the covariates exhibit low-rank structure. However, its ability to handle settings with noisy, missing, and mixed-valued, that is, discrete and continuous, covariates is not understood and remains an important open challenge. As the main contribution of this work, we establish the robustness of PCR, without any change, in this respect and provide meaningful finite-sample analysis. To do so, we establish that PCR is equivalent to performing linear regression after preprocessing the covariate matrix via hard singular value thresholding (HSVT). As a result, in the context of counterfactual analysis using observational data, we show PCR is equivalent to the recently proposed robust variant of the synthetic control method, known as robust synthetic control (RSC). As an immediate consequence, we obtain finite-sample analysis of the RSC estimator that was previously absent. As an important contribution to the synthetic controls literature, we establish that an (approximate) linear synthetic control exists in the setting of a generalized factor model, or latent variable model; traditionally in the literature, the existence of a synthetic control needs to be assumed to exist as an axiom. We further discuss a surprising implication of the robustness property of PCR with respect to noise, that is, PCR can learn a good predictive model even if the covariates are tactfully transformed to preserve differential privacy. Finally, this work advances the state-of-the-art analysis for HSVT by establishing stronger guarantees with respect to the l2,∞ -norm rather than the Frobenius norm as is commonly done in the matrix estimation literature, which may be of interest in its own right. Journal: Journal of the American Statistical Association Pages: 1731-1745 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1928513 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1928513 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1731-1745 Template-Type: ReDIF-Article 1.0 Author-Name: Jincheng Zhou Author-X-Name-First: Jincheng Author-X-Name-Last: Zhou Author-Name: James S. Hodges Author-X-Name-First: James S. Author-X-Name-Last: Hodges Author-Name: Haitao Chu Author-X-Name-First: Haitao Author-X-Name-Last: Chu Title: A Bayesian Hierarchical CACE Model Accounting for Incomplete Noncompliance With Application to a Meta-analysis of Epidural Analgesia on Cesarean Section Abstract: Noncompliance with assigned treatments is a common challenge in analyzing and interpreting randomized clinical trials (RCTs). One way to handle noncompliance is to estimate the complier-average causal effect (CACE), the intervention’s efficacy in the subpopulation that complies with assigned treatment. In a two-step meta-analysis, one could first estimate CACE for each study, then combine them to estimate the population-averaged CACE. However, when some trials do not report noncompliance data, the two-step meta-analysis can be less efficient and potentially biased by excluding these trials. This article proposes a flexible Bayesian hierarchical CACE framework to simultaneously account for heterogeneous and incomplete noncompliance data in a meta-analysis of RCTs. The models are motivated by and used for a meta-analysis estimating the CACE of epidural analgesia on cesarean section, in which only 10 of 27 trials reported complete noncompliance data. The new analysis includes all 27 studies and the results present new insights on the causal effect after accounting for noncompliance. Compared to the estimated risk difference of 0.8% (95% CI: –0.3%, 1.9%) given by the two-step intention-to-treat meta-analysis, the estimated CACE is 4.1% (95% CrI: –0.3%, 10.5%). We also report simulation studies to evaluate the performance of the proposed method. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1700-1712 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1900859 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1900859 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1700-1712 Template-Type: ReDIF-Article 1.0 Author-Name: Xinyi Li Author-X-Name-First: Xinyi Author-X-Name-Last: Li Author-Name: Li Wang Author-X-Name-First: Li Author-X-Name-Last: Wang Author-Name: Huixia Judy Wang Author-X-Name-First: Huixia Judy Author-X-Name-Last: Wang Author-Name: Author-X-Name-First: Author-X-Name-Last: Title: Sparse Learning and Structure Identification for Ultrahigh-Dimensional Image-on-Scalar Regression Abstract: This article considers high-dimensional image-on-scalar regression, where the spatial heterogeneity of covariate effects on imaging responses is investigated via a flexible partially linear spatially varying coefficient model. To tackle the challenges of spatial smoothing over the imaging response’s complex domain consisting of regions of interest, we approximate the spatially varying coefficient functions via bivariate spline functions over triangulation. We first study estimation when the active constant coefficients and varying coefficient functions are known in advance. We then further develop a unified approach for simultaneous sparse learning and model structure identification in the presence of ultrahigh-dimensional covariates. Our method can identify zero, nonzero constant, and spatially varying components correctly and efficiently. The estimators of constant coefficients and varying coefficient functions are consistent and asymptotically normal for constant coefficient estimators. The method is evaluated by Monte Carlo simulation studies and applied to a dataset provided by the Alzheimer’s Disease Neuroimaging Initiative. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1994-2008 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1753523 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1753523 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1994-2008 Template-Type: ReDIF-Article 1.0 Author-Name: Eli Ben-Michael Author-X-Name-First: Eli Author-X-Name-Last: Ben-Michael Author-Name: Avi Feller Author-X-Name-First: Avi Author-X-Name-Last: Feller Author-Name: Jesse Rothstein Author-X-Name-First: Jesse Author-X-Name-Last: Rothstein Title: The Augmented Synthetic Control Method Abstract: The synthetic control method (SCM) is a popular approach for estimating the impact of a treatment on a single unit in panel data settings. The “synthetic control” is a weighted average of control units that balances the treated unit’s pretreatment outcomes and other covariates as closely as possible. A critical feature of the original proposal is to use SCM only when the fit on pretreatment outcomes is excellent. We propose Augmented SCM as an extension of SCM to settings where such pretreatment fit is infeasible. Analogous to bias correction for inexact matching, augmented SCM uses an outcome model to estimate the bias due to imperfect pretreatment fit and then de-biases the original SCM estimate. Our main proposal, which uses ridge regression as the outcome model, directly controls pretreatment fit while minimizing extrapolation from the convex hull. This estimator can also be expressed as a solution to a modified synthetic controls problem that allows negative weights on some donor units. We bound the estimation error of this approach under different data-generating processes, including a linear factor model, and show how regularization helps to avoid over-fitting to noise. We demonstrate gains from Augmented SCM with extensive simulation studies and apply this framework to estimate the impact of the 2012 Kansas tax cuts on economic growth. We implement the proposed method in the new augsynth R package. Journal: Journal of the American Statistical Association Pages: 1789-1803 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1929245 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1929245 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1789-1803 Template-Type: ReDIF-Article 1.0 Author-Name: Fei Xue Author-X-Name-First: Fei Author-X-Name-Last: Xue Author-Name: Annie Qu Author-X-Name-First: Annie Author-X-Name-Last: Qu Title: Integrating Multisource Block-Wise Missing Data in Model Selection Abstract: For multisource data, blocks of variable information from certain sources are likely missing. Existing methods for handling missing data do not take structures of block-wise missing data into consideration. In this article, we propose a multiple block-wise imputation (MBI) approach, which incorporates imputations based on both complete and incomplete observations. Specifically, for a given missing pattern group, the imputations in MBI incorporate more samples from groups with fewer observed variables in addition to the group with complete observations. We propose to construct estimating equations based on all available information, and integrate informative estimating functions to achieve efficient estimators. We show that the proposed method has estimation and model selection consistency under both fixed-dimensional and high-dimensional settings. Moreover, the proposed estimator is asymptotically more efficient than the estimator based on a single imputation from complete observations only. In addition, the proposed method is not restricted to missing completely at random. Numerical studies and ADNI data application confirm that the proposed method outperforms existing variable selection methods under various missing mechanisms. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1914-1927 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1751176 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1751176 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1914-1927 Template-Type: ReDIF-Article 1.0 Author-Name: Nicholas P. Jewell Author-X-Name-First: Nicholas P. Author-X-Name-Last: Jewell Title: Statistical Models for COVID-19 Incidence, Cumulative Prevalence, and R t Journal: Journal of the American Statistical Association Pages: 1578-1582 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1983436 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1983436 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1578-1582 Template-Type: ReDIF-Article 1.0 Author-Name: Jason Wu Author-X-Name-First: Jason Author-X-Name-Last: Wu Author-Name: Peng Ding Author-X-Name-First: Peng Author-X-Name-Last: Ding Title: Randomization Tests for Weak Null Hypotheses in Randomized Experiments Abstract: The Fisher randomization test (FRT) is appropriate for any test statistic, under a sharp null hypothesis that can recover all missing potential outcomes. However, it is often sought after to test a weak null hypothesis that the treatment does not affect the units on average. To use the FRT for a weak null hypothesis, we must address two issues. First, we need to impute the missing potential outcomes although the weak null hypothesis cannot determine all of them. Second, we need to choose a proper test statistic. For a general weak null hypothesis, we propose an approach to imputing missing potential outcomes under a compatible sharp null hypothesis. Building on this imputation scheme, we advocate a studentized statistic. The resulting FRT has multiple desirable features. First, it is model-free. Second, it is finite-sample exact under the sharp null hypothesis that we use to impute the potential outcomes. Third, it conservatively controls large-sample Type I error under the weak null hypothesis of interest. Therefore, our FRT is agnostic to the treatment effect heterogeneity. We establish a unified theory for general factorial experiments and extend it to stratified and clustered experiments. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1898-1913 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1750415 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1750415 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1898-1913 Template-Type: ReDIF-Article 1.0 Author-Name: Alberto Abadie Author-X-Name-First: Alberto Author-X-Name-Last: Abadie Author-Name: Matias D. Cattaneo Author-X-Name-First: Matias D. Author-X-Name-Last: Cattaneo Title: Introduction to the Special Section on Synthetic Control Methods Journal: Journal of the American Statistical Association Pages: 1713-1715 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.2002600 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2002600 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1713-1715 Template-Type: ReDIF-Article 1.0 Author-Name: Sourav Chatterjee Author-X-Name-First: Sourav Author-X-Name-Last: Chatterjee Title: A New Coefficient of Correlation Abstract: Abstract–Is it possible to define a coefficient of correlation which is (a) as simple as the classical coefficients like Pearson’s correlation or Spearman’s correlation, and yet (b) consistently estimates some simple and interpretable measure of the degree of dependence between the variables, which is 0 if and only if the variables are independent and 1 if and only if one is a measurable function of the other, and (c) has a simple asymptotic theory under the hypothesis of independence, like the classical coefficients? This article answers this question in the affirmative, by producing such a coefficient. No assumptions are needed on the distributions of the variables. There are several coefficients in the literature that converge to 0 if and only if the variables are independent, but none that satisfy any of the other properties mentioned above. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2009-2022 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1758115 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1758115 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:2009-2022 Template-Type: ReDIF-Article 1.0 Author-Name: Simón Lunagómez Author-X-Name-First: Simón Author-X-Name-Last: Lunagómez Author-Name: Sofia C. Olhede Author-X-Name-First: Sofia C. Author-X-Name-Last: Olhede Author-Name: Patrick J. Wolfe Author-X-Name-First: Patrick J. Author-X-Name-Last: Wolfe Title: Modeling Network Populations via Graph Distances Abstract: This article introduces a new class of models for multiple networks. The core idea is to parameterize a distribution on labeled graphs in terms of a Fréchet mean graph (which depends on a user-specified choice of metric or graph distance) and a parameter that controls the concentration of this distribution about its mean. Entropy is the natural parameter for such control, varying from a point mass concentrated on the Fréchet mean itself to a uniform distribution over all graphs on a given vertex set. We provide a hierarchical Bayesian approach for exploiting this construction, along with straightforward strategies for sampling from the resultant posterior distribution. We conclude by demonstrating the efficacy of our approach via simulation studies and two multiple-network data analysis examples: one drawn from systems biology and the other from neuroscience. This article has online supplementary materials. Journal: Journal of the American Statistical Association Pages: 2023-2040 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1763803 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1763803 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:2023-2040 Template-Type: ReDIF-Article 1.0 Author-Name: Colin B. Fogarty Author-X-Name-First: Colin B. Author-X-Name-Last: Fogarty Author-Name: Kwonsang Lee Author-X-Name-First: Kwonsang Author-X-Name-Last: Lee Author-Name: Rachel R. Kelz Author-X-Name-First: Rachel R. Author-X-Name-Last: Kelz Author-Name: Luke J. Keele Author-X-Name-First: Luke J. Author-X-Name-Last: Keele Title: Biased Encouragements and Heterogeneous Effects in an Instrumental Variable Study of Emergency General Surgical Outcomes Abstract: We investigate the efficacy of surgical versus nonsurgical management for two gastrointestinal conditions, colitis and diverticulitis, using observational data. We deploy an instrumental variable design with surgeons’ tendencies to operate as an instrument. Assuming instrument validity, we find that nonsurgical alternatives can reduce both hospital length of stay and the risk of complications, with estimated effects larger for septic patients than for nonseptic patients. The validity of our instrument is plausible but not ironclad, necessitating a sensitivity analysis. Existing sensitivity analyses for IV designs assume effect homogeneity, unlikely to hold here because of patient-specific physiology. We develop a new sensitivity analysis that accommodates arbitrary effect heterogeneity and exploits components explainable by observed features. We find that the results for nonseptic patients prove more robust to hidden bias despite having smaller estimated effects. For nonseptic patients, two individuals with identical observed characteristics would have to differ in their odds of assignment to a high tendency to operate surgeon by a factor of 2.34 to overturn our finding of a benefit for nonsurgical management in reducing length of stay. For septic patients, this value is only 1.64. Simulations illustrate that this phenomenon may be explained by differences in within-group heterogeneity. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1625-1636 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1863220 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1863220 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1625-1636 Template-Type: ReDIF-Article 1.0 Author-Name: Victor Chernozhukov Author-X-Name-First: Victor Author-X-Name-Last: Chernozhukov Author-Name: Kaspar Wüthrich Author-X-Name-First: Kaspar Author-X-Name-Last: Wüthrich Author-Name: Yinchu Zhu Author-X-Name-First: Yinchu Author-X-Name-Last: Zhu Title: An Exact and Robust Conformal Inference Method for Counterfactual and Synthetic Controls Abstract: We introduce new inference procedures for counterfactual and synthetic control methods for policy evaluation. We recast the causal inference problem as a counterfactual prediction and a structural breaks testing problem. This allows us to exploit insights from conformal prediction and structural breaks testing to develop permutation inference procedures that accommodate modern high-dimensional estimators, are valid under weak and easy-to-verify conditions, and are provably robust against misspecification. Our methods work in conjunction with many different approaches for predicting counterfactual mean outcomes in the absence of the policy intervention. Examples include synthetic controls, difference-in-differences, factor and matrix completion models, and (fused) time series panel data models. Our approach demonstrates an excellent small-sample performance in simulations and is taken to a data application where we re-evaluate the consequences of decriminalizing indoor prostitution. Open-source software for implementing our conformal inference methods is available. Journal: Journal of the American Statistical Association Pages: 1849-1864 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1920957 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1920957 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1849-1864 Template-Type: ReDIF-Article 1.0 Author-Name: Zhigang Li Author-X-Name-First: Zhigang Author-X-Name-Last: Li Author-Name: Lu Tian Author-X-Name-First: Lu Author-X-Name-Last: Tian Author-Name: A. James O’Malley Author-X-Name-First: A. James Author-X-Name-Last: O’Malley Author-Name: Margaret R. Karagas Author-X-Name-First: Margaret R. Author-X-Name-Last: Karagas Author-Name: Anne G. Hoen Author-X-Name-First: Anne G. Author-X-Name-Last: Hoen Author-Name: Brock C. Christensen Author-X-Name-First: Brock C. Author-X-Name-Last: Christensen Author-Name: Juliette C. Madan Author-X-Name-First: Juliette C. Author-X-Name-Last: Madan Author-Name: Quran Wu Author-X-Name-First: Quran Author-X-Name-Last: Wu Author-Name: Raad Z. Gharaibeh Author-X-Name-First: Raad Z. Author-X-Name-Last: Gharaibeh Author-Name: Christian Jobin Author-X-Name-First: Christian Author-X-Name-Last: Jobin Author-Name: Hongzhe Li Author-X-Name-First: Hongzhe Author-X-Name-Last: Li Title: IFAA: Robust Association Identification and Inference for Absolute Abundance in Microbiome Analyses Abstract: The target of inference in microbiome analyses is usually relative abundance (RA) because RA in a sample (e.g., stool) can be considered as an approximation of RA in an entire ecosystem (e.g., gut). However, inference on RA suffers from the fact that RA are calculated by dividing absolute abundances (AAs) over the common denominator (CD), the summation of all AA (i.e., library size). Because of that, perturbation in one taxon will result in a change in the CD and thus cause false changes in RA of all other taxa, and those false changes could lead to false positive/negative findings. We propose a novel analysis approach (IFAA) to make robust inference on AA of an ecosystem that can circumvent the issues induced by the CD problem and compositional structure of RA. IFAA can also address the issues of overdispersion and handle zero-inflated data structures. IFAA identifies microbial taxa associated with the covariates in Phase 1 and estimates the association parameters by employing an independent reference taxon in Phase 2. Two real data applications are presented and extensive simulations show that IFAA outperforms other established existing approaches by a big margin in the presence of unbalanced library size. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1595-1608 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2020.1860770 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1860770 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1595-1608 Template-Type: ReDIF-Article 1.0 Author-Name: Alberto Abadie Author-X-Name-First: Alberto Author-X-Name-Last: Abadie Author-Name: Jérémy L’Hour Author-X-Name-First: Jérémy Author-X-Name-Last: L’Hour Title: A Penalized Synthetic Control Estimator for Disaggregated Data Abstract: Synthetic control methods are commonly applied in empirical research to estimate the effects of treatments or interventions on aggregate outcomes. A synthetic control estimator compares the outcome of a treated unit to the outcome of a weighted average of untreated units that best resembles the characteristics of the treated unit before the intervention. When disaggregated data are available, constructing separate synthetic controls for each treated unit may help avoid interpolation biases. However, the problem of finding a synthetic control that best reproduces the characteristics of a treated unit may not have a unique solution. Multiplicity of solutions is a particularly daunting challenge when the data include many treated and untreated units. To address this challenge, we propose a synthetic control estimator that penalizes the pairwise discrepancies between the characteristics of the treated units and the characteristics of the units that contribute to their synthetic controls. The penalization parameter trades off pairwise matching discrepancies with respect to the characteristics of each unit in the synthetic control against matching discrepancies with respect to the characteristics of the synthetic control unit as a whole. We study the properties of this estimator and propose data-driven choices of the penalization parameter. Journal: Journal of the American Statistical Association Pages: 1817-1834 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1971535 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1971535 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1817-1834 Template-Type: ReDIF-Article 1.0 Author-Name: Corbin Quick Author-X-Name-First: Corbin Author-X-Name-Last: Quick Author-Name: Rounak Dey Author-X-Name-First: Rounak Author-X-Name-Last: Dey Author-Name: Xihong Lin Author-X-Name-First: Xihong Author-X-Name-Last: Lin Title: Rejoinder: Regression Models for Understanding COVID-19 Epidemic Dynamics With Incomplete Data Journal: Journal of the American Statistical Association Pages: 1591-1594 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.2001340 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2001340 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1591-1594 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Correction Journal: Journal of the American Statistical Association Pages: 2100-2100 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1969237 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1969237 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:2100-2100 Template-Type: ReDIF-Article 1.0 Author-Name: Ricardo Masini Author-X-Name-First: Ricardo Author-X-Name-Last: Masini Author-Name: Marcelo C. Medeiros Author-X-Name-First: Marcelo C. Author-X-Name-Last: Medeiros Title: Counterfactual Analysis With Artificial Controls: Inference, High Dimensions, and Nonstationarity Abstract: Recently, there has been growing interest in developing statistical tools to conduct counterfactual analysis with aggregate data when a single “treated” unit suffers an intervention, such as a policy change, and there is no obvious control group. Usually, the proposed methods are based on the construction of an artificial counterfactual from a pool of “untre ated” peers, organized in a panel data structure. In this article, we consider a general framework for counterfactual analysis for high-dimensional, nonstationary data with either deterministic and/or stochastic trends, which nests well-established methods, such as the synthetic control. We propose a resampling procedure to test intervention effects that does not rely on postintervention asymptotics and that can be used even if there is only a single observation after the intervention. A simulation study is provided as well as an empirical application. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1773-1788 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1964978 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1964978 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1773-1788 Template-Type: ReDIF-Article 1.0 Author-Name: Dingdong Yi Author-X-Name-First: Dingdong Author-X-Name-Last: Yi Author-Name: Shaoyang Ning Author-X-Name-First: Shaoyang Author-X-Name-Last: Ning Author-Name: Chia-Jung Chang Author-X-Name-First: Chia-Jung Author-X-Name-Last: Chang Author-Name: S. C. Kou Author-X-Name-First: S. C. Author-X-Name-Last: Kou Title: Forecasting Unemployment Using Internet Search Data via PRISM Abstract: Big data generated from the Internet offer great potential for predictive analysis. Here we focus on using online users’ Internet search data to forecast unemployment initial claims weeks into the future, which provides timely insights into the direction of the economy. To this end, we present a novel method Penalized Regression with Inferred Seasonality Module (PRISM), which uses publicly available online search data from Google. PRISM is a semiparametric method, motivated by a general state-space formulation, and employs nonparametric seasonal decomposition and penalized regression. For forecasting unemployment initial claims, PRISM outperforms all previously available methods, including forecasting during the 2008–2009 financial crisis period and near-future forecasting during the COVID-19 pandemic period, when unemployment initial claims both rose rapidly. The timely and accurate unemployment forecasts by PRISM could aid government agencies and financial institutions to assess the economic trend and make well-informed decisions, especially in the face of economic turbulence. Journal: Journal of the American Statistical Association Pages: 1662-1673 Issue: 536 Volume: 116 Year: 2021 Month: 10 X-DOI: 10.1080/01621459.2021.1883436 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1883436 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1662-1673 Template-Type: ReDIF-Article 1.0 Author-Name: Xianyang Zhang Author-X-Name-First: Xianyang Author-X-Name-Last: Zhang Author-Name: Jun Chen Author-X-Name-First: Jun Author-X-Name-Last: Chen Title: Covariate Adaptive False Discovery Rate Control With Applications to Omics-Wide Multiple Testing Abstract: Conventional multiple testing procedures often assume hypotheses for different features are exchangeable. However, in many scientific applications, additional covariate information regarding the patterns of signals and nulls are available. In this article, we introduce an FDR control procedure in large-scale inference problem that can incorporate covariate information. We develop a fast algorithm to implement the proposed procedure and prove its asymptotic validity even when the underlying likelihood ratio model is misspecified and the p-values are weakly dependent (e.g., strong mixing). Extensive simulations are conducted to study the finite sample performance of the proposed method and we demonstrate that the new approach improves over the state-of-the-art approaches by being flexible, robust, powerful, and computationally efficient. We finally apply the method to several omics datasets arising from genomics studies with the aim to identify omics features associated with some clinical and biological phenotypes. We show that the method is overall the most powerful among competing methods, especially when the signal is sparse. The proposed covariate adaptive multiple testing procedure is implemented in the R package CAMT. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 411-427 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1783273 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1783273 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:411-427 Template-Type: ReDIF-Article 1.0 Author-Name: Holger Dette Author-X-Name-First: Holger Author-X-Name-Last: Dette Author-Name: Guangming Pan Author-X-Name-First: Guangming Author-X-Name-Last: Pan Author-Name: Qing Yang Author-X-Name-First: Qing Author-X-Name-Last: Yang Title: Estimating a Change Point in a Sequence of Very High-Dimensional Covariance Matrices Abstract: This article considers the problem of estimating a change point in the covariance matrix in a sequence of high-dimensional vectors, where the dimension is substantially larger than the sample size. A two-stage approach is proposed to efficiently estimate the location of the change point. The first step consists of a reduction of the dimension to identify elements of the covariance matrices corresponding to significant changes. In a second step, we use the components after dimension reduction to determine the position of the change point. Theoretical properties are developed for both steps, and numerical studies are conducted to support the new methodology. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 444-454 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1785477 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1785477 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:444-454 Template-Type: ReDIF-Article 1.0 Author-Name: Bei Jiang Author-X-Name-First: Bei Author-X-Name-Last: Jiang Author-Name: Adrian E. Raftery Author-X-Name-First: Adrian E. Author-X-Name-Last: Raftery Author-Name: Russell J. Steele Author-X-Name-First: Russell J. Author-X-Name-Last: Steele Author-Name: Naisyin Wang Author-X-Name-First: Naisyin Author-X-Name-Last: Wang Title: Balancing Inferential Integrity and Disclosure Risk Via Model Targeted Masking and Multiple Imputation Abstract: There is a growing expectation that data collected by government-funded studies should be openly available to ensure research reproducibility, which also increases concerns about data privacy. A strategy to protect individuals’ identity is to release multiply imputed (MI) synthetic datasets with masked sensitivity values. However, information loss or incorrectly specified imputation models can weaken or invalidate the inferences obtained from the MI-datasets. We propose a new masking framework with a data-augmentation (DA) component and a tuning mechanism that balances protecting identity disclosure against preserving data utility. Applying it to a restricted-use Canadian Scleroderma Research Group (CSRG) dataset, we found that this DA-MI strategy achieved a 0% identity disclosure risk and preserved all inferential conclusions. It yielded 95% confidence intervals (CIs) that had overlaps of 98.5% (95.5%) on average with the CIs constructed using the full, unmasked CSRG dataset in a work-disability (interstitial lung disease) study. The CI-overlaps were lower for several other methods considered, ranging from 73.9% to 91.9% on average with the lowest value being 28.1%; such low CI-overlaps further led to some incorrect inferential conclusions. These findings indicate that the DA-MI masking framework facilitates sharing of useful research data while protecting participants’ identities. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 52-66 Issue: 537 Volume: 117 Year: 2021 Month: 5 X-DOI: 10.1080/01621459.2021.1909597 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1909597 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2021:i:537:p:52-66 Template-Type: ReDIF-Article 1.0 Author-Name: Rong Chen Author-X-Name-First: Rong Author-X-Name-Last: Chen Author-Name: Dan Yang Author-X-Name-First: Dan Author-X-Name-Last: Yang Author-Name: Cun-Hui Zhang Author-X-Name-First: Cun-Hui Author-X-Name-Last: Zhang Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 128-132 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2022.2035099 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2035099 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:128-132 Template-Type: ReDIF-Article 1.0 Author-Name: Adam Ciarleglio Author-X-Name-First: Adam Author-X-Name-Last: Ciarleglio Author-Name: Eva Petkova Author-X-Name-First: Eva Author-X-Name-Last: Petkova Author-Name: Ofer Harel Author-X-Name-First: Ofer Author-X-Name-Last: Harel Title: Elucidating Age and Sex-Dependent Association Between Frontal EEG Asymmetry and Depression: An Application of Multiple Imputation in Functional Regression Abstract: Frontal power asymmetry (FA), a measure of brain function derived from electroencephalography, is a potential biomarker for major depressive disorder (MDD). Though FA is functional in nature, it is typically reduced to a scalar value prior to analysis, possibly obscuring its relationship with MDD and leading to a number of studies that have provided contradictory results. To overcome this issue, we sought to fit a functional regression model to characterize the association between FA and MDD status, adjusting for age, sex, cognitive ability, and handedness using data from a large clinical study that included both MDD and healthy control (HC) subjects. Since nearly 40% of the observations are missing data on either FA or cognitive ability, we propose an extension of multiple imputation (MI) by chained equations that allows for the imputation of both scalar and functional data. We also propose an extension of Rubin’s Rules for conducting valid inference in this setting. The proposed methods are evaluated in a simulation and applied to our FA data. For our FA data, a pooled analysis from the imputed datasets yielded similar results to those of the complete case analysis. We found that, among young females, HCs tended to have higher FA over the θ, α, and β frequency bands, but that the difference between HC and MDD subjects diminishes and ultimately reverses with age. For males, HCs tended to have higher FA in the β frequency band, regardless of age. Young male HCs had higher FA in the θ and α bands, but this difference diminishes with increasing age in the α band and ultimately reverses with increasing age in the θ band. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 12-26 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2021.1942011 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1942011 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:12-26 Template-Type: ReDIF-Article 1.0 Author-Name: Kori Khan Author-X-Name-First: Kori Author-X-Name-Last: Khan Author-Name: Catherine A. Calder Author-X-Name-First: Catherine A. Author-X-Name-Last: Calder Title: Restricted Spatial Regression Methods: Implications for Inference Abstract: The issue of spatial confounding between the spatial random effect and the fixed effects in regression analyses has been identified as a concern in the statistical literature. Multiple authors have offered perspectives and potential solutions. In this article, for the areal spatial data setting, we show that many of the methods designed to alleviate spatial confounding can be viewed as special cases of a general class of models. We refer to this class as restricted spatial regression (RSR) models, extending terminology currently in use. We offer a mathematically based exploration of the impact that RSR methods have on inference for regression coefficients for the linear model. We then explore whether these results hold in the generalized linear model setting for count data using simulations. We show that the use of these methods have counterintuitive consequences which defy the general expectations in the literature. In particular, our results and the accompanying simulations suggest that RSR methods will typically perform worse than nonspatial methods. These results have important implications for dimension reduction strategies in spatial regression modeling. Specifically, we demonstrate that the problems with RSR models cannot be fixed with a selection of “better” spatial basis vectors or dimension reduction techniques. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 482-494 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1788949 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1788949 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:482-494 Template-Type: ReDIF-Article 1.0 Author-Name: Tung-Yu Wu Author-X-Name-First: Tung-Yu Author-X-Name-Last: Wu Author-Name: Y. X. Rachel Wang Author-X-Name-First: Y. X. Author-X-Name-Last: Rachel Wang Author-Name: Wing H. Wong Author-X-Name-First: Wing H. Author-X-Name-Last: Wong Title: Mini-Batch Metropolis–Hastings With Reversible SGLD Proposal Abstract: Traditional Markov chain Monte Carlo (MCMC) algorithms are computationally intensive and do not scale well to large data. In particular, the Metropolis–Hastings (MH) algorithm requires passing over the entire dataset to evaluate the likelihood ratio in each iteration. We propose a general framework for performing MH-MCMC using mini-batches of the whole dataset and show that this gives rise to approximately a tempered stationary distribution. We prove that the algorithm preserves the modes of the original target distribution and derive an error bound on the approximation with mild assumptions on the likelihood. To further extend the utility of the algorithm to high-dimensional settings, we construct a proposal with forward and reverse moves using stochastic gradient and show that the construction leads to reasonable acceptance probabilities. We demonstrate the performance of our algorithm in both low dimensional models and high dimensional neural network applications. Particularly in the latter case, compared to popular optimization methods, our method is more robust to the choice of learning rate and improves testing accuracy. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 386-394 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1782222 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1782222 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:386-394 Template-Type: ReDIF-Article 1.0 Author-Name: Gareth M. James Author-X-Name-First: Gareth M. Author-X-Name-Last: James Author-Name: Peter Radchenko Author-X-Name-First: Peter Author-X-Name-Last: Radchenko Author-Name: Bradley Rava Author-X-Name-First: Bradley Author-X-Name-Last: Rava Title: Irrational Exuberance: Correcting Bias in Probability Estimates Abstract: We consider the common setting where one observes probability estimates for a large number of events, such as default risks for numerous bonds. Unfortunately, even with unbiased estimates, selecting events corresponding to the most extreme probabilities can result in systematically underestimating the true level of uncertainty. We develop an empirical Bayes approach “excess certainty adjusted probabilities” (ECAP), using a variant of Tweedie’s formula, which updates probability estimates to correct for selection bias. ECAP is a flexible nonparametric method, which directly estimates the score function associated with the probability estimates, so it does not need to make any restrictive assumptions about the prior on the true probabilities. ECAP also works well in settings where the probability estimates are biased. We demonstrate through theoretical results, simulations, and an analysis of two real world datasets, that ECAP can provide significant improvements over the original probability estimates. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 455-468 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1787175 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1787175 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:455-468 Template-Type: ReDIF-Article 1.0 Author-Name: Daniel Peña Author-X-Name-First: Daniel Author-X-Name-Last: Peña Title: Comment on “Factor Models for High-Dimensional Tensor Time Series” Journal: Journal of the American Statistical Association Pages: 118-123 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2021.2024214 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2024214 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:118-123 Template-Type: ReDIF-Article 1.0 Author-Name: Oliver B. Linton Author-X-Name-First: Oliver B. Author-X-Name-Last: Linton Author-Name: Haihan Tang Author-X-Name-First: Haihan Author-X-Name-Last: Tang Title: Comment on “Factor Models for High-Dimensional Tensor Time Series” by Rong Chen, Dan Yang, and Cun-Hui Zhang Journal: Journal of the American Statistical Association Pages: 117-117 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2021.2018328 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2018328 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:117-117 Template-Type: ReDIF-Article 1.0 Author-Name: Jianyu Liu Author-X-Name-First: Jianyu Author-X-Name-Last: Liu Author-Name: Haodong Wang Author-X-Name-First: Haodong Author-X-Name-Last: Wang Author-Name: Wei Sun Author-X-Name-First: Wei Author-X-Name-Last: Sun Author-Name: Yufeng Liu Author-X-Name-First: Yufeng Author-X-Name-Last: Liu Title: Prioritizing Autism Risk Genes Using Personalized Graphical Models Estimated From Single-Cell RNA-seq Data Abstract: Hundreds of autism risk genes have been reported recently, mainly based on genetic studies where these risk genes have more de novo mutations in autism subjects than healthy controls. However, as a complex disease, autism is likely associated with more risk genes and many of them may not be identifiable through de novo mutations. We hypothesize that more autism risk genes can be identified through their connections with known autism risk genes in personalized gene–gene interaction graphs. We estimate such personalized graphs using single-cell RNA sequencing (scRNA-seq) while appropriately modeling the cell dependence and possible zero-inflation in the scRNA-seq data. The sample size, which is the number of cells per individual, ranges from 891 to 1241 in our case study using scRNA-seq data in autism subjects and controls. We consider 1500 genes in our analysis. Since the number of genes is larger or comparable to the sample size, we perform penalized estimation. We score each gene’s relevance by applying a simple graph kernel smoothing method to each personalized graph. The molecular functions of the top-scored genes are related to autism diseases. For example, a candidate gene RYR2 that encodes protein ryanodine receptor 2 is involved in neurotransmission, a process that is impaired in ASD patients. While our method provides a systemic and unbiased approach to prioritize autism risk genes, the relevance of these genes needs to be further validated in functional studies. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 38-51 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2021.1933495 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1933495 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:38-51 Template-Type: ReDIF-Article 1.0 Author-Name: Fan Bu Author-X-Name-First: Fan Author-X-Name-Last: Bu Author-Name: Allison E. Aiello Author-X-Name-First: Allison E. Author-X-Name-Last: Aiello Author-Name: Jason Xu Author-X-Name-First: Jason Author-X-Name-Last: Xu Author-Name: Alexander Volfovsky Author-X-Name-First: Alexander Author-X-Name-Last: Volfovsky Title: Likelihood-Based Inference for Partially Observed Epidemics on Dynamic Networks Abstract: We propose a generative model and an inference scheme for epidemic processes on dynamic, adaptive contact networks. Network evolution is formulated as a link-Markovian process, which is then coupled to an individual-level stochastic susceptible-infectious-recovered model, to describe the interplay between the dynamics of the disease spread and the contact network underlying the epidemic. A Markov chain Monte Carlo framework is developed for likelihood-based inference from partial epidemic observations, with a novel data augmentation algorithm specifically designed to deal with missing individual recovery times under the dynamic network setting. Through a series of simulation experiments, we demonstrate the validity and flexibility of the model as well as the efficacy and efficiency of the data augmentation inference scheme. The model is also applied to a recent real-world dataset on influenza-like-illness transmission with high-resolution social contact tracking records. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 510-526 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1790376 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1790376 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:510-526 Template-Type: ReDIF-Article 1.0 Author-Name: Ben Dai Author-X-Name-First: Ben Author-X-Name-Last: Dai Author-Name: Xiaotong Shen Author-X-Name-First: Xiaotong Author-X-Name-Last: Shen Author-Name: Junhui Wang Author-X-Name-First: Junhui Author-X-Name-Last: Wang Title: Embedding Learning Abstract: Numerical embedding has become one standard technique for processing and analyzing unstructured data that cannot be expressed in a predefined fashion. It stores the main characteristics of data by mapping it onto a numerical vector. An embedding is often unsupervised and constructed by transfer learning from large-scale unannotated data. Given an embedding, a downstream learning method, referred to as a two-stage method, is applicable to unstructured data. In this article, we introduce a novel framework of embedding learning to deliver a higher learning accuracy than the two-stage method while identifying an optimal learning-adaptive embedding. In particular, we propose a concept of U-minimal sufficient learning-adaptive embeddings, based on which we seek an optimal one to maximize the learning accuracy subject to an embedding constraint. Moreover, when specializing the general framework to classification, we derive a graph embedding classifier based on a hyperlink tensor representing multiple hypergraphs, directed or undirected, characterizing multi-way relations of unstructured data. Numerically, we design algorithms based on blockwise coordinate descent and projected gradient descent to implement linear and feed-forward neural network classifiers, respectively. Theoretically, we establish a learning theory to quantify the generalization error of the proposed method. Moreover, we show, in linear regression, that the one-hot encoder is more preferable among two-stage methods, yet its dimension restriction hinders its predictive performance. For a graph embedding classifier, the generalization error matches up to the standard fast rate or the parametric rate for linear or nonlinear classification. Finally, we demonstrate the utility of the classifiers on two benchmarks in grammatical classification and sentiment analysis. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 307-319 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1775614 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1775614 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:307-319 Template-Type: ReDIF-Article 1.0 Author-Name: Jialiang Li Author-X-Name-First: Jialiang Author-X-Name-Last: Li Author-Name: Jing Lv Author-X-Name-First: Jing Author-X-Name-Last: Lv Author-Name: Alan T. K. Wan Author-X-Name-First: Alan T. K. Author-X-Name-Last: Wan Author-Name: Jun Liao Author-X-Name-First: Jun Author-X-Name-Last: Liao Title: AdaBoost Semiparametric Model Averaging Prediction for Multiple Categories Abstract: Model average techniques are very useful for model-based prediction. However, most earlier works in this field focused on parametric models and continuous responses. In this article, we study varying coefficient multinomial logistic models and propose a semiparametric model averaging prediction (SMAP) approach for multi-category outcomes. The proposed procedure does not need any artificial specification of the index variable in the adopted varying coefficient sub-model structure to forecast the response. In particular, this new SMAP method is more flexible and robust against model misspecification. To improve the practical predictive performance, we combine SMAP with the AdaBoost algorithm to obtain more accurate estimations of class probabilities and model averaging weights. We compare our proposed methods with all existing model averaging approaches and a wide range of popular classification methods via extensive simulations. An automobile classification study is included to illustrate the merits of our methodology. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 495-509 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1790375 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1790375 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:495-509 Template-Type: ReDIF-Article 1.0 Author-Name: Ian Laga Author-X-Name-First: Ian Author-X-Name-Last: Laga Author-Name: Xiaoyue Niu Author-X-Name-First: Xiaoyue Author-X-Name-Last: Niu Author-Name: Le Bao Author-X-Name-First: Le Author-X-Name-Last: Bao Title: Modeling the Marked Presence-Only Data: A Case Study of Estimating the Female Sex Worker Size in Malawi Abstract: Certain subpopulations like female sex workers (FSW), men who have sex with men (MSM), and people who inject drugs (PWID) often have higher prevalence of HIV/AIDS and are difficult to map directly due to stigma, discrimination, and criminalization. Fine-scale mapping of those populations contributes to the progress toward reducing the inequalities and ending the AIDS epidemic. In 2016 and 2017, the PLACE surveys were conducted at 3290 venues in 20 out of the total 28 districts in Malawi to estimate the FSW sizes. These venues represent a presence-only dataset where, instead of knowing both where people live and do not live (presence–absence data), only information about visited locations is available. In this study, we develop a Bayesian model for presence-only data and utilize the PLACE data to estimate the FSW size and uncertainty interval at a 1.5×1.5-km resolution for all of Malawi. The estimates can also be aggregated to any desirable level (city/district/region) for implementing targeted HIV prevention and treatment programs in FSW communities, which have been successful in lowering the incidence of HIV and other sexually transmitted infections. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 27-37 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2021.1944873 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1944873 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:27-37 Template-Type: ReDIF-Article 1.0 Author-Name: Hongjian Shi Author-X-Name-First: Hongjian Author-X-Name-Last: Shi Author-Name: Mathias Drton Author-X-Name-First: Mathias Author-X-Name-Last: Drton Author-Name: Fang Han Author-X-Name-First: Fang Author-X-Name-Last: Han Title: Distribution-Free Consistent Independence Tests via Center-Outward Ranks and Signs Abstract: This article investigates the problem of testing independence of two random vectors of general dimensions. For this, we give for the first time a distribution-free consistent test. Our approach combines distance covariance with the center-outward ranks and signs developed by Marc Hallin and collaborators. In technical terms, the proposed test is consistent and distribution-free in the family of multivariate distributions with nonvanishing (Lebesgue) probability densities. Exploiting the (degenerate) U-statistic structure of the distance covariance and the combinatorial nature of Hallin’s center-outward ranks and signs, we are able to derive the limiting null distribution of our test statistic. The resulting asymptotic approximation is accurate already for moderate sample sizes and makes the test implementable without requiring permutation. The limiting distribution is derived via a more general result that gives a new type of combinatorial noncentral limit theorem for double- and multiple-indexed permutation statistics. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 395-410 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1782223 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1782223 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:395-410 Template-Type: ReDIF-Article 1.0 Author-Name: Jun Yu Author-X-Name-First: Jun Author-X-Name-Last: Yu Author-Name: HaiYing Wang Author-X-Name-First: HaiYing Author-X-Name-Last: Wang Author-Name: Mingyao Ai Author-X-Name-First: Mingyao Author-X-Name-Last: Ai Author-Name: Huiming Zhang Author-X-Name-First: Huiming Author-X-Name-Last: Zhang Title: Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators With Massive Data Abstract: Nonuniform subsampling methods are effective to reduce computational burden and maintain estimation efficiency for massive data. Existing methods mostly focus on subsampling with replacement due to its high computational efficiency. If the data volume is so large that nonuniform subsampling probabilities cannot be calculated all at once, then subsampling with replacement is infeasible to implement. This article solves this problem using Poisson subsampling. We first derive optimal Poisson subsampling probabilities in the context of quasi-likelihood estimation under the A- and L-optimality criteria. For a practically implementable algorithm with approximated optimal subsampling probabilities, we establish the consistency and asymptotic normality of the resultant estimators. To deal with the situation that the full data are stored in different blocks or at multiple locations, we develop a distributed subsampling framework, in which statistics are computed simultaneously on smaller partitions of the full data. Asymptotic properties of the resultant aggregated estimator are investigated. We illustrate and evaluate the proposed strategies through numerical experiments on simulated and real datasets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 265-276 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1773832 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1773832 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:265-276 Template-Type: ReDIF-Article 1.0 Author-Name: Youngjun Choe Author-X-Name-First: Youngjun Author-X-Name-Last: Choe Title: An Introduction to Acceptance Sampling and SPC with R Journal: Journal of the American Statistical Association Pages: 528-528 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2022.2035160 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2035160 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:528-528 Template-Type: ReDIF-Article 1.0 Author-Name: James Y. Dai Author-X-Name-First: James Y. Author-X-Name-Last: Dai Author-Name: Janet L. Stanford Author-X-Name-First: Janet L. Author-X-Name-Last: Stanford Author-Name: Michael LeBlanc Author-X-Name-First: Michael Author-X-Name-Last: LeBlanc Title: A Multiple-Testing Procedure for High-Dimensional Mediation Hypotheses Abstract: Mediation analysis is of rising interest in epidemiologic studies and clinical trials. Among existing methods, the joint significance test yields an overly conservative Type I error rate and low power, particularly for high-dimensional mediation hypotheses. In this article, we develop a multiple-testing procedure that accurately controls the family-wise error rate (FWER) and the false discovery rate (FDR) when testing high-dimensional mediation hypotheses. The core of our procedure is based on estimating the proportions of component null hypotheses and the underlying mixture null distribution of p-values. Theoretical developments and simulation experiments prove that the proposed procedure effectively controls FWER and FDR. Two mediation analyses on DNA methylation and cancer research are presented: assessing the mediation role of DNA methylation in genetic regulation of gene expression in primary prostate cancer samples; exploring the possibility of DNA methylation mediating the effect of exercise on prostate cancer progression. Results of data examples include well-behaved quantile-quantile plots and improved power to detect novel mediation relationships. An R package HDMT implementing the proposed procedure is freely accessible in CRAN. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 198-213 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1765785 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1765785 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:198-213 Template-Type: ReDIF-Article 1.0 Author-Name: Rong Chen Author-X-Name-First: Rong Author-X-Name-Last: Chen Author-Name: Dan Yang Author-X-Name-First: Dan Author-X-Name-Last: Yang Author-Name: Cun-Hui Zhang Author-X-Name-First: Cun-Hui Author-X-Name-Last: Zhang Title: Factor Models for High-Dimensional Tensor Time Series Abstract: Large tensor (multi-dimensional array) data routinely appear nowadays in a wide range of applications, due to modern data collection capabilities. Often such observations are taken over time, forming tensor time series. In this article we present a factor model approach to the analysis of high-dimensional dynamic tensor time series and multi-category dynamic transport networks. This article presents two estimation procedures along with their theoretical properties and simulation results. We present two applications to illustrate the model and its interpretations. Journal: Journal of the American Statistical Association Pages: 94-116 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2021.1912757 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1912757 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:94-116 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation: Correction Journal: Journal of the American Statistical Association Pages: 529-529 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2021.2016420 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2016420 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:529-529 Template-Type: ReDIF-Article 1.0 Author-Name: Jun Yang Author-X-Name-First: Jun Author-X-Name-Last: Yang Author-Name: Zhou Zhou Author-X-Name-First: Zhou Author-X-Name-Last: Zhou Title: Spectral Inference under Complex Temporal Dynamics Abstract: We develop a unified theory and methodology for the inference of evolutionary Fourier power spectra for a general class of locally stationary and possibly nonlinear processes. In particular, simultaneous confidence regions (SCR) with asymptotically correct coverage rates are constructed for the evolutionary spectral densities on a nearly optimally dense grid of the joint time-frequency domain. A simulation based bootstrap method is proposed to implement the SCR. The SCR enables researchers and practitioners to visually evaluate the magnitude and pattern of the evolutionary power spectra with asymptotically accurate statistical guarantee. The SCR also serves as a unified tool for a wide range of statistical inference problems in time-frequency analysis ranging from tests for white noise, stationarity and time-frequency separability to the validation for non-stationary linear models. Journal: Journal of the American Statistical Association Pages: 133-155 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1764365 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1764365 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:133-155 Template-Type: ReDIF-Article 1.0 Author-Name: Abolfazl Safikhani Author-X-Name-First: Abolfazl Author-X-Name-Last: Safikhani Author-Name: Ali Shojaie Author-X-Name-First: Ali Author-X-Name-Last: Shojaie Title: Joint Structural Break Detection and Parameter Estimation in High-Dimensional Nonstationary VAR Models Abstract: Assuming stationarity is unrealistic in many time series applications. A more realistic alternative is to assume piecewise stationarity, where the model can change at potentially many change points. We propose a three-stage procedure for simultaneous estimation of change points and parameters of high-dimensional piecewise vector autoregressive (VAR) models. In the first step, we reformulate the change point detection problem as a high-dimensional variable selection one, and solve it using a penalized least square estimator with a total variation penalty. We show that the penalized estimation method over-estimates the number of change points, and propose a selection criterion to identify the change points. In the last step of our procedure, we estimate the VAR parameters in each of the segments. We prove that the proposed procedure consistently detects the number and location of change points, and provides consistent estimates of VAR parameters. The performance of the method is illustrated through several simulated and real data examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 251-264 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1770097 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1770097 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:251-264 Template-Type: ReDIF-Article 1.0 Author-Name: Yaping Wang Author-X-Name-First: Yaping Author-X-Name-Last: Wang Author-Name: Fasheng Sun Author-X-Name-First: Fasheng Author-X-Name-Last: Sun Author-Name: Hongquan Xu Author-X-Name-First: Hongquan Author-X-Name-Last: Xu Title: On Design Orthogonality, Maximin Distance, and Projection Uniformity for Computer Experiments Abstract: Space-filling designs are widely used in both computer and physical experiments. Column-orthogonality, maximin distance, and projection uniformity are three basic and popular space-filling criteria proposed from different perspectives, but their relationships have been rarely investigated. We show that the average squared correlation metric is a function of the pairwise L2-distances between the rows only. We further explore the connection between uniform projection designs and maximin L1-distance designs. Based on these connections, we develop new lower and upper bounds for column-orthogonality and projection uniformity from the perspective of distance between design points. These results not only provide new theoretical justifications for each criterion but also help in finding better space-filling designs under multiple criteria. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 375-385 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1782221 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1782221 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:375-385 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Correction to: Semiparametric Inference for Non-monotone Missing-Not-at-Random Data: the No Self-Censoring Model Journal: Journal of the American Statistical Association Pages: 530-530 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2021.2016421 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2016421 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:530-530 Template-Type: ReDIF-Article 1.0 Author-Name: Ray Bai Author-X-Name-First: Ray Author-X-Name-Last: Bai Author-Name: Gemma E. Moran Author-X-Name-First: Gemma E. Author-X-Name-Last: Moran Author-Name: Joseph L. Antonelli Author-X-Name-First: Joseph L. Author-X-Name-Last: Antonelli Author-Name: Yong Chen Author-X-Name-First: Yong Author-X-Name-Last: Chen Author-Name: Mary R. Boland Author-X-Name-First: Mary R. Author-X-Name-Last: Boland Title: Spike-and-Slab Group Lassos for Grouped Regression and Sparse Generalized Additive Models Abstract: Abstract–We introduce the spike-and-slab group lasso (SSGL) for Bayesian estimation and variable selection in linear regression with grouped variables. We further extend the SSGL to sparse generalized additive models (GAMs), thereby introducing the first nonparametric variant of the spike-and-slab lasso methodology. Our model simultaneously performs group selection and estimation, while our fully Bayes treatment of the mixture proportion allows for model complexity control and automatic self-adaptivity to different levels of sparsity. We develop theory to uniquely characterize the global posterior mode under the SSGL and introduce a highly efficient block coordinate ascent algorithm for maximum a posteriori estimation. We further employ de-biasing methods to provide uncertainty quantification of our estimates. Thus, implementation of our model avoids the computational intensiveness of Markov chain Monte Carlo in high dimensions. We derive posterior concentration rates for both grouped linear regression and sparse GAMs when the number of covariates grows at nearly exponential rate with sample size. Finally, we illustrate our methodology through extensive simulations and data analysis. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 184-197 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1765784 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1765784 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:184-197 Template-Type: ReDIF-Article 1.0 Author-Name: Derek Feng Author-X-Name-First: Derek Author-X-Name-Last: Feng Author-Name: Randolf Altmeyer Author-X-Name-First: Randolf Author-X-Name-Last: Altmeyer Author-Name: Derek Stafford Author-X-Name-First: Derek Author-X-Name-Last: Stafford Author-Name: Nicholas A. Christakis Author-X-Name-First: Nicholas A. Author-X-Name-Last: Christakis Author-Name: Harrison H. Zhou Author-X-Name-First: Harrison H. Author-X-Name-Last: Zhou Title: Testing for Balance in Social Networks Abstract: Friendship and antipathy exist in concert with one another in real social networks. Despite the role they play in social interactions, antagonistic ties are poorly understood and infrequently measured. One important theory of negative ties that has received relatively little empirical evaluation is balance theory, the codification of the adage “the enemy of my enemy is my friend” and similar sayings. Unbalanced triangles are those with an odd number of negative ties, and the theory posits that such triangles are rare. To test for balance, previous works have used a permutation test on the edge signs. The flaw in this method, however, is that it assumes that negative and positive edges are interchangeable. In reality, they could not be more different. Here, we propose a novel test of balance that accounts for this discrepancy and show that our test is more accurate at detecting balance. Along the way, we prove asymptotic normality of the test statistic under our null model, which is of independent interest. Our case study is a novel dataset of signed networks we collected from 32 isolated, rural villages in Honduras. Contrary to previous results, we find that there is only marginal evidence for balance in social tie formation in this setting. Journal: Journal of the American Statistical Association Pages: 156-174 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1764850 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1764850 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:156-174 Template-Type: ReDIF-Article 1.0 Author-Name: Jialin Ouyang Author-X-Name-First: Jialin Author-X-Name-Last: Ouyang Author-Name: Ming Yuan Author-X-Name-First: Ming Author-X-Name-Last: Yuan Title: Comments on “Factor Models for High-Dimensional Tensor Time Series” Journal: Journal of the American Statistical Association Pages: 124-127 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2022.2028630 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2028630 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:124-127 Template-Type: ReDIF-Article 1.0 Author-Name: Valérie Garès Author-X-Name-First: Valérie Author-X-Name-Last: Garès Author-Name: Jérémy Omer Author-X-Name-First: Jérémy Author-X-Name-Last: Omer Title: Regularized Optimal Transport of Covariates and Outcomes in Data Recoding Abstract: When databases are constructed from heterogeneous sources, it is not unusual that different encodings are used for the same outcome. In such case, it is necessary to recode the outcome variable before merging two databases. The method proposed for the recoding is an application of optimal transportation where we search for a bijective mapping between the distributions of such variable in two databases. In this article, we build upon the work by Garés et al., where they transport the distributions of categorical outcomes assuming that they are distributed equally in the two databases. Here, we extend the scope of the model to treat all the situations where the covariates explain the outcomes similarly in the two databases. In particular, we do not require that the outcomes be distributed equally. For this, we propose a model where joint distributions of outcomes and covariates are transported. We also propose to enrich the model by relaxing the constraints on marginal distributions and adding an L1 regularization term. The performances of the models are evaluated in a simulation study, and they are applied to a real dataset. The code used in the computational assessment and in the simulation of test cases is publicly available on Github repository: https://github.com/otrecoding/OTRecod.jl. Journal: Journal of the American Statistical Association Pages: 320-333 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1775615 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1775615 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:320-333 Template-Type: ReDIF-Article 1.0 Author-Name: Zhenhua Lin Author-X-Name-First: Zhenhua Author-X-Name-Last: Lin Author-Name: Jane-Ling Wang Author-X-Name-First: Jane-Ling Author-X-Name-Last: Wang Title: Mean and Covariance Estimation for Functional Snippets Abstract: We consider estimation of mean and covariance functions of functional snippets, which are short segments of functions possibly observed irregularly on an individual specific subinterval that is much shorter than the entire study interval. Estimation of the covariance function for functional snippets is challenging since information for the far off-diagonal regions of the covariance structure is completely missing. We address this difficulty by decomposing the covariance function into a variance function component and a correlation function component. The variance function can be effectively estimated nonparametrically, while the correlation part is modeled parametrically, possibly with an increasing number of parameters, to handle the missing information in the far off-diagonal regions. Both theoretical analysis and numerical simulations suggest that this hybrid strategy is effective. In addition, we propose a new estimator for the variance of measurement errors and analyze its asymptotic properties. This estimator is required for the estimation of the variance function from noisy measurements. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 348-360 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1777138 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1777138 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:348-360 Template-Type: ReDIF-Article 1.0 Author-Name: Jinyuan Chang Author-X-Name-First: Jinyuan Author-X-Name-Last: Chang Author-Name: Eric D. Kolaczyk Author-X-Name-First: Eric D. Author-X-Name-Last: Kolaczyk Author-Name: Qiwei Yao Author-X-Name-First: Qiwei Author-X-Name-Last: Yao Title: Estimation of Subgraph Densities in Noisy Networks Abstract: While it is common practice in applied network analysis to report various standard network summary statistics, these numbers are rarely accompanied by uncertainty quantification. Yet any error inherent in the measurements underlying the construction of the network, or in the network construction procedure itself, necessarily must propagate to any summary statistics reported. Here we study the problem of estimating the density of an arbitrary subgraph, given a noisy version of some underlying network as data. Under a simple model of network error, we show that consistent estimation of such densities is impossible when the rates of error are unknown and only a single network is observed. Accordingly, we develop method-of-moment estimators of network subgraph densities and error rates for the case where a minimal number of network replicates are available. These estimators are shown to be asymptotically normal as the number of vertices increases to infinity. We also provide confidence intervals for quantifying the uncertainty in these estimates based on the asymptotic normality. To construct the confidence intervals, a new and nonstandard bootstrap method is proposed to compute asymptotic variances, which is infeasible otherwise. We illustrate the proposed methods in the context of gene coexpression networks. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 361-374 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1778482 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1778482 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:361-374 Template-Type: ReDIF-Article 1.0 Author-Name: Chris McKennan Author-X-Name-First: Chris Author-X-Name-Last: McKennan Author-Name: Dan Nicolae Author-X-Name-First: Dan Author-X-Name-Last: Nicolae Title: Estimating and Accounting for Unobserved Covariates in High-Dimensional Correlated Data Abstract: Many high-dimensional and high-throughput biological datasets have complex sample correlation structures, which include longitudinal and multiple tissue data, as well as data with multiple treatment conditions or related individuals. These data, as well as nearly all high-throughput “omic” data, are influenced by technical and biological factors unknown to the researcher, which, if unaccounted for, can severely obfuscate estimation of and inference on the effects of interest. We therefore developed CBCV and CorrConf: provably accurate and computationally efficient methods to choose the number of and estimate latent confounding factors present in high-dimensional data with correlated or nonexchangeable residuals. We demonstrate each method’s superior performance compared to other state of the art methods by analyzing simulated multi-tissue gene expression data and identifying sex-associated DNA methylation sites in a real, longitudinal twin study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 225-236 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1769635 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1769635 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:225-236 Template-Type: ReDIF-Article 1.0 Author-Name: Mats J. Stensrud Author-X-Name-First: Mats J. Author-X-Name-Last: Stensrud Author-Name: Jessica G. Young Author-X-Name-First: Jessica G. Author-X-Name-Last: Young Author-Name: Vanessa Didelez Author-X-Name-First: Vanessa Author-X-Name-Last: Didelez Author-Name: James M. Robins Author-X-Name-First: James M. Author-X-Name-Last: Robins Author-Name: Miguel A. Hernán Author-X-Name-First: Miguel A. Author-X-Name-Last: Hernán Title: Separable Effects for Causal Inference in the Presence of Competing Events Abstract: In time-to-event settings, the presence of competing events complicates the definition of causal effects. Here we propose the new separable effects to study the causal effect of a treatment on an event of interest. The separable direct effect is the treatment effect on the event of interest not mediated by its effect on the competing event. The separable indirect effect is the treatment effect on the event of interest only through its effect on the competing event. Similar to Robins and Richardson’s extended graphical approach for mediation analysis, the separable effects can only be identified under the assumption that the treatment can be decomposed into two distinct components that exert their effects through distinct causal pathways. Unlike existing definitions of causal effects in the presence of competing events, our estimands do not require cross-world contrasts or hypothetical interventions to prevent death. As an illustration, we apply our approach to a randomized clinical trial on estrogen therapy in individuals with prostate cancer. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 175-183 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1765783 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1765783 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:175-183 Template-Type: ReDIF-Article 1.0 Author-Name: Bingxin Zhao Author-X-Name-First: Bingxin Author-X-Name-Last: Zhao Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Title: On Genetic Correlation Estimation With Summary Statistics From Genome-Wide Association Studies Abstract: Cross-trait polygenic risk score (PRS) method has gained popularity for assessing genetic correlation of complex traits using summary statistics from biobank-scale genome-wide association studies (GWAS). However, empirical evidence has shown a common bias phenomenon that highly significant cross-trait PRS can only account for a very small amount of genetic variance (R2 can be <1% ) in independent testing GWAS. The aim of this paper is to investigate and address the bias phenomenon of cross-trait PRS in numerous GWAS applications. We show that the estimated genetic correlation can be asymptotically biased toward zero. A consistent cross-trait PRS estimator is then proposed to correct such asymptotic bias. In addition, we investigate whether or not SNP screening by GWAS p-values can lead to improved estimation and show the effect of overlapping samples among GWAS. We analyze GWAS summary statistics of reaction time and brain structural magnetic resonance imaging-based features measured in the Pediatric Imaging, Neurocognition, and Genetics study. We find that the raw cross-trait PRS estimators heavily underestimate the genetic similarity between cognitive function and human brain structures (mean R2=1.32% ), whereas the bias-corrected estimators uncover the moderate degree of genetic overlap between these closely related heritable traits (mean R2=22.42% ). Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1-11 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2021.1906684 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1906684 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:1-11 Template-Type: ReDIF-Article 1.0 Author-Name: Albert Xingyi Man Author-X-Name-First: Albert Xingyi Author-X-Name-Last: Man Author-Name: Steven Andrew Culpepper Author-X-Name-First: Steven Andrew Author-X-Name-Last: Culpepper Title: A Mode-Jumping Algorithm for Bayesian Factor Analysis Abstract: Exploratory factor analysis is a dimension-reduction technique commonly used in psychology, finance, genomics, neuroscience, and economics. Advances in computational power have opened the door for fully Bayesian treatments of factor analysis. One open problem is enforcing rotational identifability of the latent factor loadings, as the loadings are not identified from the likelihood without further restrictions. Nonidentifability of the loadings can cause posterior multimodality, which can produce misleading posterior summaries. The positive-diagonal, lower-triangular (PLT) constraint is the most commonly used restriction to guarantee identifiability, in which the upper m × m submatrix of the loadings is constrained to be a lower-triangular matrix with positive-diagonal elements. The PLT constraint can fail to guarantee identifiability if the constrained submatrix is singular. Furthermore, though the PLT constraint addresses identifiability-related multimodality, it introduces additional mixing issues. We introduce a new Bayesian sampling algorithm that efficiently explores the multimodal posterior surface and addresses issues with PLT-constrained approaches. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 277-290 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1773833 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1773833 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:277-290 Template-Type: ReDIF-Article 1.0 Author-Name: Ying-Qi Zhao Author-X-Name-First: Ying-Qi Author-X-Name-Last: Zhao Title: Dynamic Treatment Regimes: Statistical Methods for Precision Medicine Journal: Journal of the American Statistical Association Pages: 527-527 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2022.2035159 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2035159 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:527-527 Template-Type: ReDIF-Article 1.0 Author-Name: Zhonghua Liu Author-X-Name-First: Zhonghua Author-X-Name-Last: Liu Author-Name: Jincheng Shen Author-X-Name-First: Jincheng Author-X-Name-Last: Shen Author-Name: Richard Barfield Author-X-Name-First: Richard Author-X-Name-Last: Barfield Author-Name: Joel Schwartz Author-X-Name-First: Joel Author-X-Name-Last: Schwartz Author-Name: Andrea A. Baccarelli Author-X-Name-First: Andrea A. Author-X-Name-Last: Baccarelli Author-Name: Xihong Lin Author-X-Name-First: Xihong Author-X-Name-Last: Lin Title: Large-Scale Hypothesis Testing for Causal Mediation Effects with Applications in Genome-wide Epigenetic Studies Abstract: In genome-wide epigenetic studies, it is of great scientific interest to assess whether the effect of an exposure on a clinical outcome is mediated through DNA methylations. However, statistical inference for causal mediation effects is challenged by the fact that one needs to test a large number of composite null hypotheses across the whole epigenome. Two popular tests, the Wald-type Sobel’s test and the joint significant test using the traditional null distribution are underpowered and thus can miss important scientific discoveries. In this article, we show that the null distribution of Sobel’s test is not the standard normal distribution and the null distribution of the joint significant test is not uniform under the composite null of no mediation effect, especially in finite samples and under the singular point null case that the exposure has no effect on the mediator and the mediator has no effect on the outcome. Our results explain why these two tests are underpowered, and more importantly motivate us to develop a more powerful divide-aggregate composite-null test (DACT) for the composite null hypothesis of no mediation effect by leveraging epigenome-wide data. We adopted Efron’s empirical null framework for assessing statistical significance of the DACT test. We showed analytically that the proposed DACT method had improved power, and could well control Type I error rate. Our extensive simulation studies showed that, in finite samples, the DACT method properly controlled the Type I error rate and outperformed Sobel’s test and the joint significance test for detecting mediation effects. We applied the DACT method to the U.S. Department of Veterans Affairs Normative Aging Study, an ongoing prospective cohort study which included men who were aged 21 to 80 years at entry. We identified multiple DNA methylation CpG sites that might mediate the effect of smoking on lung function with effect sizes ranging from –0.18 to –0.79 and false discovery rate controlled at the level 0.05, including the CpG sites in the genes AHRR and F2RL3. Our sensitivity analysis found small residual correlations (less than 0.01) of the error terms between the outcome and mediator regressions, suggesting that our results are robust to unmeasured confounding factors. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 67-81 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2021.1914634 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1914634 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:67-81 Template-Type: ReDIF-Article 1.0 Author-Name: Félix Camirand Lemyre Author-X-Name-First: Félix Author-X-Name-Last: Camirand Lemyre Author-Name: Raymond J. Carroll Author-X-Name-First: Raymond J. Author-X-Name-Last: Carroll Author-Name: Aurore Delaigle Author-X-Name-First: Aurore Author-X-Name-Last: Delaigle Title: Semiparametric Estimation of the Distribution of Episodically Consumed Foods Measured With Error Abstract: Dietary data collected from 24-hour dietary recalls are observed with significant measurement errors. In the nonparametric curve estimation literature, much of the effort has been devoted to designing methods that are consistent under contamination by noise, and which have been traditionally applied for analyzing those data. However, some foods such as alcohol or fruits are consumed only episodically, and may not be consumed during the day when the 24-hour recall is administered. These so-called excess zeros make existing nonparametric estimators break down, and new techniques need to be developed for such data. We develop two new consistent semiparametric estimators of the distribution of such episodically consumed food data, making parametric assumptions only on some less important parts of the model. We establish its theoretical properties and illustrate the good performance of our fully data-driven method in simulated and real data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 469-481 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1787840 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1787840 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:469-481 Template-Type: ReDIF-Article 1.0 Author-Name: Masako Ikefuji Author-X-Name-First: Masako Author-X-Name-Last: Ikefuji Author-Name: Roger J. A. Laeven Author-X-Name-First: Roger J. A. Author-X-Name-Last: Laeven Author-Name: Jan R. Magnus Author-X-Name-First: Jan R. Author-X-Name-Last: Magnus Author-Name: Yuan Yue Author-X-Name-First: Yuan Author-X-Name-Last: Yue Title: Earthquake Risk Embedded in Property Prices: Evidence From Five Japanese Cities Abstract: We analyze the impact of short-run (90 days) and long-run (30 years) earthquake risk on real estate transaction prices in five Japanese cities (Tokyo, Osaka, Nagoya, Fukuoka, and Sapporo), using quarterly data over the period 2006–2015. We exploit a rich panel dataset (331,343 observations) with property characteristics, ward attractiveness information, macroeconomic variables, and long-run seismic hazard data, supplemented with short-run earthquake probabilities generated from a seismic excitation model using historical earthquake occurrences. We design a hedonic property price model that allows for subjective probability weighting, employ a multivariate error components structure, and develop associated maximum likelihood estimation and variance computation procedures. Our approach enables us to identify the total compensation for earthquake risk embedded in property prices, to decompose this into pieces stemming from short-run and long-run risk, and to distinguish between objective and subjectively weighted (“distorted”) earthquake probabilities. We find that objective long-run earthquake probabilities have a statistically significant negative impact on property prices, whereas short-run earthquake probabilities become statistically significant only when we allow them to be distorted. The total compensation for earthquake risk amounts to an average –2.0% of log property prices, slightly more than the annual income of a middle-income Japanese household. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 82-93 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2021.1928512 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1928512 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:82-93 Template-Type: ReDIF-Article 1.0 Author-Name: Lilun Du Author-X-Name-First: Lilun Author-X-Name-Last: Du Author-Name: Inchi Hu Author-X-Name-First: Inchi Author-X-Name-Last: Hu Title: An Empirical Bayes Method for Chi-Squared Data Abstract: In a thought-provoking paper, Efron investigated the merit and limitation of an empirical Bayes method to correct selection bias based on Tweedie’s formula first reported in the study by Robbins. The exceptional virtue of Tweedie’s formula for the normal distribution lies in its representation of selection bias as a simple function of the derivative of log marginal likelihood. Since the marginal likelihood and its derivative can be estimated from the data directly without invoking prior information, bias correction can be carried out conveniently. We propose a Bayesian hierarchical model for chi-squared data such that the resulting Tweedie’s formula has the same virtue as that of the normal distribution. Because the family of noncentral chi-squared distributions, the common alternative distributions for chi-squared tests, does not constitute an exponential family, our results cannot be obtained by extending existing results. Furthermore, the corresponding Tweedie’s formula manifests new phenomena quite different from those of the normal distribution and suggests new ways of analyzing chi-squared data. Journal: Journal of the American Statistical Association Pages: 334-347 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1777137 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1777137 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:334-347 Template-Type: ReDIF-Article 1.0 Author-Name: Wanjun Liu Author-X-Name-First: Wanjun Author-X-Name-Last: Liu Author-Name: Yuan Ke Author-X-Name-First: Yuan Author-X-Name-Last: Ke Author-Name: Jingyuan Liu Author-X-Name-First: Jingyuan Author-X-Name-Last: Liu Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Title: Model-Free Feature Screening and FDR Control With Knockoff Features Abstract: This article proposes a model-free and data-adaptive feature screening method for ultrahigh-dimensional data. The proposed method is based on the projection correlation which measures the dependence between two random vectors. This projection correlation based method does not require specifying a regression model, and applies to data in the presence of heavy tails and multivariate responses. It enjoys both sure screening and rank consistency properties under weak assumptions. A two-step approach, with the help of knockoff features, is advocated to specify the threshold for feature screening such that the false discovery rate (FDR) is controlled under a prespecified level. The proposed two-step approach enjoys both sure screening and FDR control simultaneously if the prespecified FDR level is greater or equal to 1/s, where s is the number of active features. The superior empirical performance of the proposed method is illustrated by simulation examples and real data applications. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 428-443 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1783274 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1783274 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:428-443 Template-Type: ReDIF-Article 1.0 Author-Name: Sooin Yun Author-X-Name-First: Sooin Author-X-Name-Last: Yun Author-Name: Xianyang Zhang Author-X-Name-First: Xianyang Author-X-Name-Last: Zhang Author-Name: Bo Li Author-X-Name-First: Bo Author-X-Name-Last: Li Title: Detection of Local Differences in Spatial Characteristics Between Two Spatiotemporal Random Fields Abstract: Comparing the spatial characteristics of spatiotemporal random fields is often at demand. However, the comparison can be challenging due to the high-dimensional feature and dependency in the data. We develop a new multiple testing approach to detect local differences in the spatial characteristics of two spatiotemporal random fields by taking the spatial information into account. Our method adopts a two-component mixture model for location wise p-values and then derives a new false discovery rate (FDR) control, called mirror procedure, to determine the optimal rejection region. This procedure is robust to model misspecification and allows for weak dependency among hypotheses. To integrate the spatial heterogeneity, we model the mixture probability as well as study the benefit if any of allowing the alternative distribution to be spatially varying. An EM-algorithm is developed to estimate the mixture model and implement the FDR procedure. We study the FDR control and the power of our new approach both theoretically and numerically, and apply the approach to compare the mean and teleconnection pattern between two synthetic climate fields. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 291-306 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1775613 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1775613 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:291-306 Template-Type: ReDIF-Article 1.0 Author-Name: Jiahui Yu Author-X-Name-First: Jiahui Author-X-Name-Last: Yu Author-Name: Jian Shi Author-X-Name-First: Jian Author-X-Name-Last: Shi Author-Name: Anna Liu Author-X-Name-First: Anna Author-X-Name-Last: Liu Author-Name: Yuedong Wang Author-X-Name-First: Yuedong Author-X-Name-Last: Wang Title: Smoothing Spline Semiparametric Density Models Abstract: Density estimation plays a fundamental role in many areas of statistics and machine learning. Parametric, nonparametric, and semiparametric density estimation methods have been proposed in the literature. Semiparametric density models are flexible in incorporating domain knowledge and uncertainty regarding the shape of the density function. Existing literature on semiparametric density models is scattered and lacks a systematic framework. In this article, we consider a unified framework based on reproducing kernel Hilbert space for modeling, estimation, computation, and theory. We propose general semiparametric density models for both a single sample and multiple samples which include many existing semiparametric density models as special cases. We develop penalized likelihood based estimation methods and computational methods under different situations. We establish joint consistency and derive convergence rates of the proposed estimators for both finite dimensional Euclidean parameters and an infinite-dimensional functional parameter. We validate our estimation methods empirically through simulations and an application. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 237-250 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1769636 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1769636 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:237-250 Template-Type: ReDIF-Article 1.0 Author-Name: David E. Allen Author-X-Name-First: David E. Author-X-Name-Last: Allen Author-Name: Michael McAleer Author-X-Name-First: Michael Author-X-Name-Last: McAleer Title: “Generalized Measures of Correlation for Asymmetry, Nonlinearity, and Beyond”: Some Antecedents on Causality Abstract: This note comments on the generalized measure of correlation (GMC) that was suggested by Zheng, Shi, and Zhang. The GMC concept was partly anticipated in some publications over 100 years earlier by Yule in the Proceedings of the Royal Society, and by Kendall. Other antecedents discussed include work on dependency by Renyi and Doksum and Samarov, together with the Yule–Simpson paradox. The GMC metric partly extends the concept of Granger causality, so that we consider causality, graphical analysis and alternative measures of dependency provided by copulas. Journal: Journal of the American Statistical Association Pages: 214-224 Issue: 537 Volume: 117 Year: 2022 Month: 1 X-DOI: 10.1080/01621459.2020.1768101 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1768101 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:214-224 Template-Type: ReDIF-Article 1.0 Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Ricardo Masini Author-X-Name-First: Ricardo Author-X-Name-Last: Masini Author-Name: Marcelo C. Medeiros Author-X-Name-First: Marcelo C. Author-X-Name-Last: Medeiros Title: Do We Exploit all Information for Counterfactual Analysis? Benefits of Factor Models and Idiosyncratic Correction Abstract: Optimal pricing, that is determining the price level that maximizes profit or revenue of a given product, is a vital task for the retail industry. To select such a quantity, one needs first to estimate the price elasticity from the product demand. Regression methods usually fail to recover such elasticities due to confounding effects and price endogeneity. Therefore, randomized experiments are typically required. However, elasticities can be highly heterogeneous depending on the location of stores, for example. As the randomization frequently occurs at the municipal level, standard difference-in-differences methods may also fail. Possible solutions are based on methodologies to measure the effects of treatments on a single (or just a few) treated unit(s) based on counterfactuals constructed from artificial controls. For example, for each city in the treatment group, a counterfactual may be constructed from the untreated locations. In this article, we apply a novel high-dimensional statistical method to measure the effects of price changes on daily sales from a major retailer in Brazil. The proposed methodology combines principal components (factors) and sparse regressions, resulting in a method called Factor-Adjusted Regularized Method for Treatment evaluation (FarmTreat). The data consist of daily sales and prices of five different products over more than 400 municipalities. The products considered belong to the sweet and candies category and experiments have been conducted over the years of 2016 and 2017. Our results confirm the hypothesis of a high degree of heterogeneity yielding very different pricing strategies over distinct municipalities. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 574-590 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2021.2004895 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2004895 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:574-590 Template-Type: ReDIF-Article 1.0 Author-Name: Haojie Ren Author-X-Name-First: Haojie Author-X-Name-Last: Ren Author-Name: Changliang Zou Author-X-Name-First: Changliang Author-X-Name-Last: Zou Author-Name: Nan Chen Author-X-Name-First: Nan Author-X-Name-Last: Chen Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Title: Large-Scale Datastreams Surveillance via Pattern-Oriented-Sampling Abstract: Monitoring large-scale datastreams with limited resources has become increasingly important for real-time detection of abnormal activities in many applications. Despite the availability of large datasets, the challenges associated with designing an efficient change-detection when clustering or spatial pattern exists are not yet well addressed. In this article, a design-adaptive testing procedure is developed when only a limited number of streaming observations can be accessed at each time. We derive an optimal sampling strategy, the pattern-oriented-sampling, with which the proposed test possesses asymptotically and locally best power under alternatives. Then, a sequential change-detection procedure is proposed by integrating this test with generalized likelihood ratio approach. Benefiting from dynamically estimating the optimal sampling design, the proposed procedure is able to improve the sensitivity in detecting clustered changes compared with existing procedures. Its advantages are demonstrated in numerical simulations and a real data example. Ignoring the neighboring information of spatially structured data will tend to diminish the detection effectiveness of traditional detection procedures. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 794-808 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1819295 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1819295 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:794-808 Template-Type: ReDIF-Article 1.0 Author-Name: Jiaying Gu Author-X-Name-First: Jiaying Author-X-Name-Last: Gu Author-Name: Roger Koenker Author-X-Name-First: Roger Author-X-Name-Last: Koenker Title: Nonparametric Maximum Likelihood Methods for Binary Response Models With Random Coefficients Abstract: The venerable method of maximum likelihood has found numerous recent applications in nonparametric estimation of regression and shape constrained densities. For mixture models the nonparametric maximum likelihood estimator (NPMLE) of Kiefer and Wolfowitz plays a central role in recent developments of empirical Bayes methods. The NPMLE has also been proposed by Cosslett as an estimation method for single index linear models for binary response with random coefficients. However, computational difficulties have hindered its application. Combining recent developments in computational geometry and convex optimization, we develop a new approach to computation for such models that dramatically increases their computational tractability. Consistency of the method is established for an expanded profile likelihood formulation. The methods are evaluated in simulation experiments, compared to the deconvolution methods of Gautier and Kitamura and illustrated in an application to modal choice for journey-to-work data in the Washington DC area. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 732-751 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1802284 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1802284 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:732-751 Template-Type: ReDIF-Article 1.0 Author-Name: Yiwei Fan Author-X-Name-First: Yiwei Author-X-Name-Last: Fan Author-Name: Xiaoling Lu Author-X-Name-First: Xiaoling Author-X-Name-Last: Lu Author-Name: Yufeng Liu Author-X-Name-First: Yufeng Author-X-Name-Last: Liu Author-Name: Junlong Zhao Author-X-Name-First: Junlong Author-X-Name-Last: Zhao Title: Angle-Based Hierarchical Classification Using Exact Label Embedding Abstract: Hierarchical classification problems are commonly seen in practice. However, most existing methods do not fully use the hierarchical information among class labels. In this article, a novel label embedding approach is proposed, which keeps the hierarchy of labels exactly, and reduces the complexity of the hypothesis space significantly. Based on the newly proposed label embedding approach, a new angle-based classifier is developed for hierarchical classification. Moreover, to handle massive data, a new (weighted) linear loss is designed, which has a closed form solution and is computationally efficient. Theoretical properties of the new method are established and intensive numerical comparisons with other methods are conducted. Both simulations and applications in document categorization demonstrate the advantages of the proposed method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 704-717 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1801450 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801450 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:704-717 Template-Type: ReDIF-Article 1.0 Author-Name: Rune Christiansen Author-X-Name-First: Rune Author-X-Name-Last: Christiansen Author-Name: Matthias Baumann Author-X-Name-First: Matthias Author-X-Name-Last: Baumann Author-Name: Tobias Kuemmerle Author-X-Name-First: Tobias Author-X-Name-Last: Kuemmerle Author-Name: Miguel D. Mahecha Author-X-Name-First: Miguel D. Author-X-Name-Last: Mahecha Author-Name: Jonas Peters Author-X-Name-First: Jonas Author-X-Name-Last: Peters Title: Toward Causal Inference for Spatio-Temporal Data: Conflict and Forest Loss in Colombia Abstract: How does armed conflict influence tropical forest loss? For Colombia, both enhancing and reducing effect estimates have been reported. However, a lack of causal methodology has prevented establishing clear causal links between these two variables. In this work, we propose a class of causal models for spatio-temporal stochastic processes which allows us to formally define and quantify the causal effect of a vector of covariates X on a real-valued response Y. We introduce a procedure for estimating causal effects and a nonparametric hypothesis test for these effects being zero. Our application is based on geospatial information on conflict events and remote-sensing-based data on forest loss between 2000 and 2018 in Colombia. Across the entire country, we estimate the effect to be slightly negative (conflict reduces forest loss) but insignificant (P = 0.578), while at the provincial level, we find both positive effects (e.g., La Guajira, P = 0.047) and negative effects (e.g., Magdalena, P = 0.004). The proposed methods do not make strong distributional assumptions, and allow for arbitrarily many latent confounders, given that these confounders do not vary across time. Our theoretical findings are supported by simulations, and code is available online. Journal: Journal of the American Statistical Association Pages: 591-601 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2021.2013241 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2013241 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:591-601 Template-Type: ReDIF-Article 1.0 Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Yuan Liao Author-X-Name-First: Yuan Author-X-Name-Last: Liao Title: Learning Latent Factors From Diversified Projections and Its Applications to Over-Estimated and Weak Factors Abstract: Estimations and applications of factor models often rely on the crucial condition that the number of latent factors is consistently estimated, which in turn also requires that factors be relatively strong, data are stationary and weakly serially dependent, and the sample size be fairly large, although in practical applications, one or several of these conditions may fail. In these cases, it is difficult to analyze the eigenvectors of the data matrix. To address this issue, we propose simple estimators of the latent factors using cross-sectional projections of the panel data, by weighted averages with predetermined weights. These weights are chosen to diversify away the idiosyncratic components, resulting in “diversified factors.” Because the projections are conducted cross-sectionally, they are robust to serial conditions, easy to analyze and work even for finite length of time series. We formally prove that this procedure is robust to over-estimating the number of factors, and illustrate it in several applications, including post-selection inference, big data forecasts, large covariance estimation, and factor specification tests. We also recommend several choices for the diversified weights. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 909-924 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1831927 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1831927 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:909-924 Template-Type: ReDIF-Article 1.0 Author-Name: Yan Dora Zhang Author-X-Name-First: Yan Dora Author-X-Name-Last: Zhang Author-Name: Brian P. Naughton Author-X-Name-First: Brian P. Author-X-Name-Last: Naughton Author-Name: Howard D. Bondell Author-X-Name-First: Howard D. Author-X-Name-Last: Bondell Author-Name: Brian J. Reich Author-X-Name-First: Brian J. Author-X-Name-Last: Reich Title: Bayesian Regression Using a Prior on the Model Fit: The R2-D2 Shrinkage Prior Abstract: Prior distributions for high-dimensional linear regression require specifying a joint distribution for the unobserved regression coefficients, which is inherently difficult. We instead propose a new class of shrinkage priors for linear regression via specifying a prior first on the model fit, in particular, the coefficient of determination, and then distributing through to the coefficients in a novel way. The proposed method compares favorably to previous approaches in terms of both concentration around the origin and tail behavior, which leads to improved performance both in posterior contraction and in empirical performance. The limiting behavior of the proposed prior is 1/x , both around the origin and in the tails. This behavior is optimal in the sense that it simultaneously lies on the boundary of being an improper prior both in the tails and around the origin. None of the existing shrinkage priors obtain this behavior in both regions simultaneously. We also demonstrate that our proposed prior leads to the same near-minimax posterior contraction rate as the spike-and-slab prior. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 862-874 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1825449 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1825449 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:862-874 Template-Type: ReDIF-Article 1.0 Author-Name: Gabriel Hassler Author-X-Name-First: Gabriel Author-X-Name-Last: Hassler Author-Name: Max R. Tolkoff Author-X-Name-First: Max R. Author-X-Name-Last: Tolkoff Author-Name: William L. Allen Author-X-Name-First: William L. Author-X-Name-Last: Allen Author-Name: Lam Si Tung Ho Author-X-Name-First: Lam Si Tung Author-X-Name-Last: Ho Author-Name: Philippe Lemey Author-X-Name-First: Philippe Author-X-Name-Last: Lemey Author-Name: Marc A. Suchard Author-X-Name-First: Marc A. Author-X-Name-Last: Suchard Title: Inferring Phenotypic Trait Evolution on Large Trees With Many Incomplete Measurements Abstract: Comparative biologists are often interested in inferring covariation between multiple biological traits sampled across numerous related taxa. To properly study these relationships, we must control for the shared evolutionary history of the taxa to avoid spurious inference. An additional challenge arises as obtaining a full suite of measurements becomes increasingly difficult with increasing taxa. This generally necessitates data imputation or integration, and existing control techniques typically scale poorly as the number of taxa increases. We propose an inference technique that integrates out missing measurements analytically and scales linearly with the number of taxa by using a post-order traversal algorithm under a multivariate Brownian diffusion (MBD) model to characterize trait evolution. We further exploit this technique to extend the MBD model to account for sampling error or nonheritable residual variance. We test these methods to examine mammalian life history traits, prokaryotic genomic and phenotypic traits, and HIV infection traits. We find computational efficiency increases that top two orders-of-magnitude over current best practices. While we focus on the utility of this algorithm in phylogenetic comparative methods, our approach generalizes to solve long-standing challenges in computing the likelihood for matrix-normal and multivariate normal distributions with missing data at scale. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 678-692 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1799812 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1799812 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:678-692 Template-Type: ReDIF-Article 1.0 Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Yingying Fan Author-X-Name-First: Yingying Author-X-Name-Last: Fan Author-Name: Xiao Han Author-X-Name-First: Xiao Author-X-Name-Last: Han Author-Name: Jinchi Lv Author-X-Name-First: Jinchi Author-X-Name-Last: Lv Title: Asymptotic Theory of Eigenvectors for Random Matrices With Diverging Spikes Abstract: Characterizing the asymptotic distributions of eigenvectors for large random matrices poses important challenges yet can provide useful insights into a range of statistical applications. To this end, in this article we introduce a general framework of asymptotic theory of eigenvectors for large spiked random matrices with diverging spikes and heterogeneous variances, and establish the asymptotic properties of the spiked eigenvectors and eigenvalues for the scenario of the generalized Wigner matrix noise. Under some mild regularity conditions, we provide the asymptotic expansions for the spiked eigenvalues and show that they are asymptotically normal after some normalization. For the spiked eigenvectors, we establish asymptotic expansions for the general linear combination and further show that it is asymptotically normal after some normalization, where the weight vector can be arbitrary. We also provide a more general asymptotic theory for the spiked eigenvectors using the bilinear form. Simulation studies verify the validity of our new theoretical results. Our family of models encompasses many popularly used ones such as the stochastic block models with or without overlapping communities for network analysis and the topic models for text analysis, and our general theory can be exploited for statistical inference in these large-scale applications. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 996-1009 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1840990 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1840990 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:996-1009 Template-Type: ReDIF-Article 1.0 Author-Name: Assaf Rabinowicz Author-X-Name-First: Assaf Author-X-Name-Last: Rabinowicz Author-Name: Saharon Rosset Author-X-Name-First: Saharon Author-X-Name-Last: Rosset Title: Cross-Validation for Correlated Data Abstract: Abstract–K-fold cross-validation (CV) with squared error loss is widely used for evaluating predictive models, especially when strong distributional assumptions cannot be taken. However, CV with squared error loss is not free from distributional assumptions, in particular in cases involving non-iid data. This article analyzes CV for correlated data. We present a criterion for suitability of standard CV in presence of correlations. When this criterion does not hold, we introduce a bias corrected CV estimator which we term CVc, that yields an unbiased estimate of prediction error in many settings where standard CV is invalid. We also demonstrate our results numerically, and find that introducing our correction substantially improves both, model evaluation and model selection in simulations and real data studies. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 718-731 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1801451 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801451 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:718-731 Template-Type: ReDIF-Article 1.0 Author-Name: D. Andrew Brown Author-X-Name-First: D. Andrew Author-X-Name-Last: Brown Author-Name: Christopher S. McMahan Author-X-Name-First: Christopher S. Author-X-Name-Last: McMahan Author-Name: Russell T. Shinohara Author-X-Name-First: Russell T. Author-X-Name-Last: Shinohara Author-Name: Kristin A. Linn Author-X-Name-First: Kristin A. Author-X-Name-Last: Linn Author-Name: Author-X-Name-First: Author-X-Name-Last: Title: Bayesian Spatial Binary Regression for Label Fusion in Structural Neuroimaging Abstract: Alzheimer’s disease is a neurodegenerative condition that accelerates cognitive decline relative to normal aging. It is of critical scientific importance to gain a better understanding of early disease mechanisms in the brain to facilitate effective, targeted therapies. The volume of the hippocampus is often used in diagnosis and monitoring of the disease. Measuring this volume via neuroimaging is difficult since each hippocampus must either be manually identified or automatically delineated, a task referred to as segmentation. Automatic hippocampal segmentation often involves mapping a previously manually segmented image to a new brain image and propagating the labels to obtain an estimate of where each hippocampus is located in the new image. A more recent approach to this problem is to propagate labels from multiple manually segmented atlases and combine the results using a process known as label fusion. To date, most label fusion algorithms employ voting procedures with voting weights assigned directly or estimated via optimization. We propose using a fully Bayesian spatial regression model for label fusion that facilitates direct incorporation of covariate information while making accessible the entire posterior distribution. Our results suggest that incorporating tissue classification (e.g., gray matter) into the label fusion procedure can greatly improve segmentation when relatively homogeneous, healthy brains are used as atlases for diseased brains. The fully Bayesian approach also produces meaningful uncertainty measures about hippocampal volumes, information which can be leveraged to detect significant, scientifically meaningful differences between healthy and diseased populations, improving the potential for early detection and tracking of the disease. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 547-560 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2021.2014854 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2014854 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:547-560 Template-Type: ReDIF-Article 1.0 Author-Name: Yaowu Liu Author-X-Name-First: Yaowu Author-X-Name-Last: Liu Author-Name: Zilin Li Author-X-Name-First: Zilin Author-X-Name-Last: Li Author-Name: Xihong Lin Author-X-Name-First: Xihong Author-X-Name-Last: Lin Title: A Minimax Optimal Ridge-Type Set Test for Global Hypothesis With Applications in Whole Genome Sequencing Association Studies Abstract: Testing a global hypothesis for a set of variables is a fundamental problem in statistics with a wide range of applications. A few well-known classical tests include the Hotelling’s T 2 test, the F-test, and the empirical Bayes based score test. These classical tests, however, are not robust to the signal strength and could have a substantial loss of power when signals are weak or moderate, a situation we commonly encounter in contemporary applications. In this article, we propose a minimax optimal ridge-type set test (MORST), a simple and generic method for testing a global hypothesis. The power of MORST is robust and considerably higher than that of the classical tests when the strength of signals is weak or moderate. In the meantime, MORST only requires a slight increase in computation compared to these existing tests, making it applicable to the analysis of massive genome-wide data. We also provide the generalizations of MORST that are parallel to the traditional Wald test and Rao’s score test in asymptotic settings. Extensive simulations demonstrated the robust power of MORST and that the Type I error of MORST was well controlled. We applied MORST to the analysis of the whole-genome sequencing data from the Atherosclerosis Risk in Communities study, where MORST detected 20%–250% more signal regions than the classical tests. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 897-908 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1831926 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1831926 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:897-908 Template-Type: ReDIF-Article 1.0 Author-Name: Muxuan Liang Author-X-Name-First: Muxuan Author-X-Name-Last: Liang Author-Name: Menggang Yu Author-X-Name-First: Menggang Author-X-Name-Last: Yu Title: A Semiparametric Approach to Model Effect Modification Abstract: One fundamental statistical question for research areas such as precision medicine and health disparity is about discovering effect modification of treatment or exposure by observed covariates. We propose a semiparametric framework for identifying such effect modification. Instead of using the traditional outcome models, we directly posit semiparametric models on contrasts, or expected differences of the outcome under different treatment choices or exposures. Through semiparametric estimation theory, all valid estimating equations, including the efficient scores, are derived. Besides doubly robust loss functions, our approach also enables dimension reduction in presence of many covariates. The asymptotic and non-asymptotic properties of the proposed methods are explored via a unified statistical and algorithmic analysis. Comparison with existing methods in both simulation and real data analysis demonstrates the superiority of our estimators especially for an efficiency improved version. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 752-764 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1811099 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1811099 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:752-764 Template-Type: ReDIF-Article 1.0 Author-Name: Jiayi Wang Author-X-Name-First: Jiayi Author-X-Name-Last: Wang Author-Name: Raymond K. W. Wong Author-X-Name-First: Raymond K. W. Author-X-Name-Last: Wong Author-Name: Xiaoke Zhang Author-X-Name-First: Xiaoke Author-X-Name-Last: Zhang Title: Low-Rank Covariance Function Estimation for Multidimensional Functional Data Abstract: Multidimensional function data arise from many fields nowadays. The covariance function plays an important role in the analysis of such increasingly common data. In this article, we propose a novel nonparametric covariance function estimation approach under the framework of reproducing kernel Hilbert spaces (RKHS) that can handle both sparse and dense functional data. We extend multilinear rank structures for (finite-dimensional) tensors to functions, which allow for flexible modeling of both covariance operators and marginal structures. The proposed framework can guarantee that the resulting estimator is automatically semipositive definite, and can incorporate various spectral regularizations. The trace-norm regularization in particular can promote low ranks for both covariance operator and marginal structures. Despite the lack of a closed form, under mild assumptions, the proposed estimator can achieve unified theoretical results that hold for any relative magnitudes between the sample size and the number of observations per sample field, and the rate of convergence reveals the phase-transition phenomenon from sparse to dense functional data. Based on a new representer theorem, an ADMM algorithm is developed for the trace-norm regularization. The appealing numerical performance of the proposed estimator is demonstrated by a simulation study and the analysis of a dataset from the Argo project. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 809-822 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1820344 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1820344 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:809-822 Template-Type: ReDIF-Article 1.0 Author-Name: Alberto Abadie Author-X-Name-First: Alberto Author-X-Name-Last: Abadie Author-Name: Jann Spiess Author-X-Name-First: Jann Author-X-Name-Last: Spiess Title: Robust Post-Matching Inference Abstract: Nearest-neighbor matching is a popular nonparametric tool to create balance between treatment and control groups in observational studies. As a preprocessing step before regression, matching reduces the dependence on parametric modeling assumptions. In current empirical practice, however, the matching step is often ignored in the calculation of standard errors and confidence intervals. In this article, we show that ignoring the matching step results in asymptotically valid standard errors if matching is done without replacement and the regression model is correctly specified relative to the population regression function of the outcome variable on the treatment variable and all the covariates used for matching. However, standard errors that ignore the matching step are not valid if matching is conducted with replacement or, more crucially, if the second step regression model is misspecified in the sense indicated above. Moreover, correct specification of the regression model is not required for consistent estimation of treatment effects with matched data. We show that two easily implementable alternatives produce approximations to the distribution of the post-matching estimator that are robust to misspecification. A simulation study and an empirical example demonstrate the empirical relevance of our results. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 983-995 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1840383 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1840383 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:983-995 Template-Type: ReDIF-Article 1.0 Author-Name: Yang Liu Author-X-Name-First: Yang Author-X-Name-Last: Liu Author-Name: Feifang Hu Author-X-Name-First: Feifang Author-X-Name-Last: Hu Title: Balancing Unobserved Covariates With Covariate-Adaptive Randomized Experiments Abstract: Balancing important covariates is often critical in clinical trials and causal inference. Stratified permuted block (STR-PB) and covariate-adaptive randomization (CAR) procedures are widely used to balance observed covariates in practice. The balance properties of these procedures with respect to the observed covariates have been well studied. However, it has been questioned whether these methods will also yield a good balance for the unobserved covariates. In this article, we develop a general framework for the analysis of the unobserved covariates imbalance. These results are applicable to develop and compare the balance properties of complete randomization (CR), STR-PB, and CAR procedures with respect to the unobserved covariates. To quantify the improvement obtained by using STR-PB and CAR procedures rather than CR, we introduce the percentage reduction in variance of the unobserved covariates imbalance and compare these quantities. Our results demonstrate the benefits of using CAR or STR-PB (when the number of strata is small relative to the sample size) in terms of balancing unobserved covariates. These results also pave the way for future research into the effect of unobserved covariates in covariate-adaptive randomized experiments in clinical trials, as well as many other applications. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 875-886 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1825450 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1825450 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:875-886 Template-Type: ReDIF-Article 1.0 Author-Name: Changliang Zou Author-X-Name-First: Changliang Author-X-Name-Last: Zou Author-Name: Yuan Ke Author-X-Name-First: Yuan Author-X-Name-Last: Ke Author-Name: Wenyang Zhang Author-X-Name-First: Wenyang Author-X-Name-Last: Zhang Title: Estimation of Low Rank High-Dimensional Multivariate Linear Models for Multi-Response Data Abstract: In this article, we study low rank high-dimensional multivariate linear models (LRMLM) for high-dimensional multi-response data. We propose an intuitively appealing estimation approach and develop an algorithm for implementation purposes. Asymptotic properties are established to justify the estimation procedure theoretically. Intensive simulation studies are also conducted to demonstrate performance when the sample size is finite, and a comparison is made with some popular methods from the literature. The results show the proposed estimator outperforms all of the alternative methods under various circumstances. Finally, using our suggested estimation procedure we apply the LRMLM to analyze an environmental dataset and predict concentrations of PM2.5 at the locations concerned. The results illustrate how the proposed method provides more accurate predictions than the alternative approaches. Journal: Journal of the American Statistical Association Pages: 693-703 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1799813 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1799813 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:693-703 Template-Type: ReDIF-Article 1.0 Author-Name: Yunxiao Li Author-X-Name-First: Yunxiao Author-X-Name-Last: Li Author-Name: Yi-Juan Hu Author-X-Name-First: Yi-Juan Author-X-Name-Last: Hu Author-Name: Glen A. Satten Author-X-Name-First: Glen A. Author-X-Name-Last: Satten Title: A Bottom-Up Approach to Testing Hypotheses That Have a Branching Tree Dependence Structure, With Error Rate Control Abstract: Modern statistical analyses often involve testing large numbers of hypotheses. In many situations, these hypotheses may have an underlying tree structure that both helps determine the order that tests should be conducted but also imposes a dependency between tests that must be accounted for. Our motivating example comes from testing the association between a trait of interest and groups of microbes that have been organized into operational taxonomic units (OTUs) or amplicon sequence variants (ASVs). Given p-values from association tests for each individual OTU or ASV, we would like to know if we can declare a certain species, genus, or higher taxonomic group to be associated with the trait. For this problem, a bottom-up testing algorithm that starts at the lowest level of the tree (OTUs or ASVs) and proceeds upward through successively higher taxonomic groupings (species, genus, family, etc.) is required. We develop such a bottom-up testing algorithm that controls a novel error rate that we call the false selection rate. By simulation, we also show that our approach is better at finding driver taxa, the highest level taxa below which there are dense association signals. We illustrate our approach using data from a study of the microbiome among patients with ulcerative colitis and healthy controls. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 664-677 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1799811 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1799811 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:664-677 Template-Type: ReDIF-Article 1.0 Author-Name: Chenguang Dai Author-X-Name-First: Chenguang Author-X-Name-Last: Dai Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Title: Monte Carlo Approximation of Bayes Factors via Mixing With Surrogate Distributions Abstract: By mixing the target posterior distribution with a surrogate distribution, of which the normalizing constant is tractable, we propose a method for estimating the marginal likelihood using the Wang–Landau algorithm. We show that a faster convergence of the proposed method can be achieved via the momentum acceleration. Two implementation strategies are detailed: (i) facilitating global jumps between the posterior and surrogate distributions via the multiple-try Metropolis (MTM); (ii) constructing the surrogate via the variational approximation. When a surrogate is difficult to come by, we describe a new jumping mechanism for general reversible jump Markov chain Monte Carlo algorithms, which combines the MTM and a directional sampling algorithm. We illustrate the proposed methods on several statistical models, including the log-Gaussian Cox process, the Bayesian Lasso, the logistic regression, and the g-prior Bayesian variable selection. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 765-780 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1811100 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1811100 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:765-780 Template-Type: ReDIF-Article 1.0 Author-Name: Zeya Wang Author-X-Name-First: Zeya Author-X-Name-Last: Wang Author-Name: Veerabhadran Baladandayuthapani Author-X-Name-First: Veerabhadran Author-X-Name-Last: Baladandayuthapani Author-Name: Ahmed O. Kaseb Author-X-Name-First: Ahmed O. Author-X-Name-Last: Kaseb Author-Name: Hesham M. Amin Author-X-Name-First: Hesham M. Author-X-Name-Last: Amin Author-Name: Manal M. Hassan Author-X-Name-First: Manal M. Author-X-Name-Last: Hassan Author-Name: Wenyi Wang Author-X-Name-First: Wenyi Author-X-Name-Last: Wang Author-Name: Jeffrey S. Morris Author-X-Name-First: Jeffrey S. Author-X-Name-Last: Morris Title: Bayesian Edge Regression in Undirected Graphical Models to Characterize Interpatient Heterogeneity in Cancer Abstract: It is well established that interpatient heterogeneity in cancer may significantly affect genomic data analyses and in particular, network topologies. Most existing graphical model methods estimate a single population-level graph for genomic or proteomic network. In many investigations, these networks depend on patient-specific indicators that characterize the heterogeneity of individual networks across subjects with respect to subject-level covariates. Examples include assessments of how the network varies with patient-specific prognostic scores or comparisons of tumor and normal graphs while accounting for tumor purity as a continuous predictor. In this article, we propose a novel edge regression model for undirected graphs, which estimates conditional dependencies as a function of subject-level covariates. We evaluate our model performance through simulation studies focused on comparing tumor and normal graphs while adjusting for tumor purity. In application to a dataset of proteomic measurements on plasma samples from patients with hepatocellular carcinoma (HCC), we ascertain how blood protein networks vary with disease severity, as measured by HepatoScore, a novel biomarker signature measuring disease severity. Our case study shows that the network connectivity increases with HepatoScore and a set of hub proteins as well as important protein connections are identified under different HepatoScore, which may provide important biological insights to the development of precision therapies for HCC. Journal: Journal of the American Statistical Association Pages: 533-546 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2021.2000866 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2000866 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:533-546 Template-Type: ReDIF-Article 1.0 Author-Name: Tianxi Li Author-X-Name-First: Tianxi Author-X-Name-Last: Li Author-Name: Lihua Lei Author-X-Name-First: Lihua Author-X-Name-Last: Lei Author-Name: Sharmodeep Bhattacharyya Author-X-Name-First: Sharmodeep Author-X-Name-Last: Bhattacharyya Author-Name: Koen Van den Berge Author-X-Name-First: Koen Author-X-Name-Last: Van den Berge Author-Name: Purnamrita Sarkar Author-X-Name-First: Purnamrita Author-X-Name-Last: Sarkar Author-Name: Peter J. Bickel Author-X-Name-First: Peter J. Author-X-Name-Last: Bickel Author-Name: Elizaveta Levina Author-X-Name-First: Elizaveta Author-X-Name-Last: Levina Title: Hierarchical Community Detection by Recursive Partitioning Abstract: The problem of community detection in networks is usually formulated as finding a single partition of the network into some “correct” number of communities. We argue that it is more interpretable and in some regimes more accurate to construct a hierarchical tree of communities instead. This can be done with a simple top-down recursive partitioning algorithm, starting with a single community and separating the nodes into two communities by spectral clustering repeatedly, until a stopping rule suggests there are no further communities. This class of algorithms is model-free, computationally efficient, and requires no tuning other than selecting a stopping rule. We show that there are regimes where this approach outperforms K-way spectral clustering, and propose a natural framework for analyzing the algorithm’s theoretical performance, the binary tree stochastic block model. Under this model, we prove that the algorithm correctly recovers the entire community tree under relatively mild assumptions. We apply the algorithm to a gene network based on gene co-occurrence in 1580 research papers on anemia, and identify six clusters of genes in a meaningful hierarchy. We also illustrate the algorithm on a dataset of statistics papers. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 951-968 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1833888 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1833888 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:951-968 Template-Type: ReDIF-Article 1.0 Author-Name: Lazhi Wang Author-X-Name-First: Lazhi Author-X-Name-Last: Wang Author-Name: David E. Jones Author-X-Name-First: David E. Author-X-Name-Last: Jones Author-Name: Xiao-Li Meng Author-X-Name-First: Xiao-Li Author-X-Name-Last: Meng Title: Warp Bridge Sampling: The Next Generation Abstract: Bridge sampling is an effective Monte Carlo (MC) method for estimating the ratio of normalizing constants of two probability densities, a routine computational problem in statistics, physics, chemistry, and other fields. The MC error of the bridge sampling estimator is determined by the amount of overlap between the two densities. In the case of unimodal densities, Warp-I, II, and III transformations are effective for increasing the initial overlap, but they are less so for multimodal densities. This article introduces Warp-U transformations that aim to transform multimodal densities into unimodal ones (hence “U”) without altering their normalizing constants. The construction of a Warp-U transformation starts with a normal (or other convenient) mixture distribution ϕmix that has reasonable overlap with the target density p, whose normalizing constant is unknown. The stochastic transformation that maps ϕmix back to its generating distribution N(0,1) is then applied to p yielding its Warp-U version, which we denote p˜ . Typically, p˜ is unimodal and has substantially increased overlap with ϕ . Furthermore, we prove that the overlap between p˜ and N(0,1) is guaranteed to be no less than the overlap between p and ϕmix , in terms of any f-divergence. We propose a computationally efficient method to find an appropriate ϕmix , and a simple but effective approach to remove the bias which results from estimating the normalizing constant and fitting ϕmix with the same data. We illustrate our findings using 10 and 50 dimensional highly irregular multimodal densities, and demonstrate how Warp-U sampling can be used to improve the final estimation step of the Generalized Wang–Landau algorithm, a powerful sampling and estimation approach. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 835-851 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1825447 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1825447 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:835-851 Template-Type: ReDIF-Article 1.0 Author-Name: Lars Arne Jordanger Author-X-Name-First: Lars Arne Author-X-Name-Last: Jordanger Author-Name: Dag Tjøstheim Author-X-Name-First: Dag Author-X-Name-Last: Tjøstheim Title: Nonlinear Spectral Analysis: A Local Gaussian Approach Abstract: The spectral distribution f(ω) of a stationary time series {Yt}t∈Z can be used to investigate whether or not periodic structures are present in {Yt}t∈Z , but f(ω) has some limitations due to its dependence on the autocovariances γ(h) . For example, f(ω) can not distinguish white iid noise from GARCH-type models (whose terms are dependent, but uncorrelated), which implies that f(ω) can be an inadequate tool when {Yt}t∈Z contains asymmetries and nonlinear dependencies. Asymmetries between the upper and lower tails of a time series can be investigated by means of the local Gaussian autocorrelations, and these local measures of dependence can be used to construct the local Gaussian spectral density presented in this paper. A key feature of the new local spectral density is that it coincides with f(ω) for Gaussian time series, which implies that it can be used to detect non-Gaussian traits in the time series under investigation. In particular, if f(ω) is flat, then peaks and troughs of the new local spectral density can indicate nonlinear traits, which potentially might discover local periodic phenomena that remain undetected in an ordinary spectral analysis. Journal: Journal of the American Statistical Association Pages: 1010-1027 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1840991 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1840991 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:1010-1027 Template-Type: ReDIF-Article 1.0 Author-Name: Jeremiah Zhe Liu Author-X-Name-First: Jeremiah Zhe Author-X-Name-Last: Liu Author-Name: Wenying Deng Author-X-Name-First: Wenying Author-X-Name-Last: Deng Author-Name: Jane Lee Author-X-Name-First: Jane Author-X-Name-Last: Lee Author-Name: Pi-i Debby Lin Author-X-Name-First: Pi-i Debby Author-X-Name-Last: Lin Author-Name: Linda Valeri Author-X-Name-First: Linda Author-X-Name-Last: Valeri Author-Name: David C. Christiani Author-X-Name-First: David C. Author-X-Name-Last: Christiani Author-Name: David C. Bellinger Author-X-Name-First: David C. Author-X-Name-Last: Bellinger Author-Name: Robert O. Wright Author-X-Name-First: Robert O. Author-X-Name-Last: Wright Author-Name: Maitreyi M. Mazumdar Author-X-Name-First: Maitreyi M. Author-X-Name-Last: Mazumdar Author-Name: Brent A. Coull Author-X-Name-First: Brent A. Author-X-Name-Last: Coull Title: A Cross-Validated Ensemble Approach to Robust Hypothesis Testing of Continuous Nonlinear Interactions: Application to Nutrition-Environment Studies Abstract: Gene-environment and nutrition-environment studies often involve testing of high-dimensional interactions between two sets of variables, each having potentially complex nonlinear main effects on an outcome. Construction of a valid and powerful hypothesis test for such an interaction is challenging, due to the difficulty in constructing an efficient and unbiased estimator for the complex, nonlinear main effects. In this work, we address this problem by proposing a cross-validated ensemble of kernels (CVEK) that learns the space of appropriate functions for the main effects using a cross-validated ensemble approach. With a carefully chosen library of base kernels, CVEK flexibly estimates the form of the main-effect functions from the data, and encourages test power by guarding against over-fitting under the alternative. The method is motivated by a study on the interaction between metal exposures in utero and maternal nutrition on children’s neurodevelopment in rural Bangladesh. The proposed tests identified evidence of an interaction between minerals and vitamins intake and arsenic and manganese exposures. Results suggest that the detrimental effects of these metals are most pronounced at low intake levels of the nutrients, suggesting nutritional interventions in pregnant women could mitigate the adverse impacts of in utero metal exposures on the children’s neurodevelopment. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 561-573 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2021.1962889 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1962889 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:561-573 Template-Type: ReDIF-Article 1.0 Author-Name: Hejian Sang Author-X-Name-First: Hejian Author-X-Name-Last: Sang Author-Name: Jae Kwang Kim Author-X-Name-First: Jae Kwang Author-X-Name-Last: Kim Author-Name: Danhyang Lee Author-X-Name-First: Danhyang Author-X-Name-Last: Lee Title: Semiparametric Fractional Imputation Using Gaussian Mixture Models for Handling Multivariate Missing Data Abstract: Item nonresponse is frequently encountered in practice. Ignoring missing data can lose efficiency and lead to misleading inference. Fractional imputation is a frequentist approach of imputation for handling missing data. However, the parametric fractional imputation may be subject to bias under model misspecification. In this article, we propose a novel semiparametric fractional imputation (SFI) method using Gaussian mixture models. The proposed method is computationally efficient and leads to robust estimation. The proposed method is further extended to incorporate the categorical auxiliary information. The asymptotic model consistency and n -consistency of the SFI estimator are also established. Some simulation studies are presented to check the finite sample performance of the proposed method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 654-663 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1796358 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1796358 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:654-663 Template-Type: ReDIF-Article 1.0 Author-Name: Erin E. Gabriel Author-X-Name-First: Erin E. Author-X-Name-Last: Gabriel Author-Name: Michael C. Sachs Author-X-Name-First: Michael C. Author-X-Name-Last: Sachs Author-Name: Arvid Sjölander Author-X-Name-First: Arvid Author-X-Name-Last: Sjölander Title: Causal Bounds for Outcome-Dependent Sampling in Observational Studies Abstract: Outcome-dependent sampling designs are common in many different scientific fields including epidemiology, ecology, and economics. As with all observational studies, such designs often suffer from unmeasured confounding, which generally precludes the nonparametric identification of causal effects. Nonparametric bounds can provide a way to narrow the range of possible values for a nonidentifiable causal effect without making additional untestable assumptions. The nonparametric bounds literature has almost exclusively focused on settings with random sampling, and the bounds have often been derived with a particular linear programming method. We derive novel bounds for the causal risk difference, often referred to as the average treatment effect, in six settings with outcome-dependent sampling and unmeasured confounding for a binary outcome and exposure. Our derivations of the bounds illustrate two approaches that may be applicable in other settings where the bounding problem cannot be directly stated as a system of linear constraints. We illustrate our derived bounds in a real data example involving the effect of vitamin D concentration on mortality. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 939-950 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1832502 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1832502 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:939-950 Template-Type: ReDIF-Article 1.0 Author-Name: Michele Peruzzi Author-X-Name-First: Michele Author-X-Name-Last: Peruzzi Author-Name: Sudipto Banerjee Author-X-Name-First: Sudipto Author-X-Name-Last: Banerjee Author-Name: Andrew O. Finley Author-X-Name-First: Andrew O. Author-X-Name-Last: Finley Title: Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains Abstract: We introduce a class of scalable Bayesian hierarchical models for the analysis of massive geostatistical datasets. The underlying idea combines ideas on high-dimensional geostatistics by partitioning the spatial domain and modeling the regions in the partition using a sparsity-inducing directed acyclic graph (DAG). We extend the model over the DAG to a well-defined spatial process, which we call the meshed Gaussian process (MGP). A major contribution is the development of an MGPs on tessellated domains, accompanied by a Gibbs sampler for the efficient recovery of spatial random effects. In particular, the cubic MGP (Q-MGP) can harness high-performance computing resources by executing all large-scale operations in parallel within the Gibbs sampler, improving mixing and computing time compared to sequential updating schemes. Unlike some existing models for large spatial data, a Q-MGP facilitates massive caching of expensive matrix operations, making it particularly apt in dealing with spatiotemporal remote-sensing data. We compare Q-MGPs with large synthetic and real world data against state-of-the-art methods. We also illustrate using Normalized Difference Vegetation Index data from the Serengeti park region to recover latent multivariate spatiotemporal random effects at millions of locations. The source code is available at github.com/mkln/meshgp. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 969-982 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1833889 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1833889 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:969-982 Template-Type: ReDIF-Article 1.0 Author-Name: The Editors Title: Correction Journal: Journal of the American Statistical Association Pages: 1043-1043 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2022.2060607 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2060607 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:1043-1043 Template-Type: ReDIF-Article 1.0 Author-Name: Somabha Mukherjee Author-X-Name-First: Somabha Author-X-Name-Last: Mukherjee Author-Name: Divyansh Agarwal Author-X-Name-First: Divyansh Author-X-Name-Last: Agarwal Author-Name: Nancy R. Zhang Author-X-Name-First: Nancy R. Author-X-Name-Last: Zhang Author-Name: Bhaswar B. Bhattacharya Author-X-Name-First: Bhaswar B. Author-X-Name-Last: Bhattacharya Title: Distribution-Free Multisample Tests Based on Optimal Matchings With Applications to Single Cell Genomics Abstract: In this article, we propose a nonparametric graphical test based on optimal matching, for assessing the equality of multiple unknown multivariate probability distributions. Our procedure pools the data from the different classes to create a graph based on the minimum non-bipartite matching, and then utilizes the number of edges connecting data points from different classes to examine the closeness between the distributions. The proposed test is exactly distribution-free (the null distribution does not depend on the distribution of the data) and can be efficiently applied to multivariate as well as non-Euclidean data, whenever the inter-point distances are well-defined. We show that the test is universally consistent, and prove a distributional limit theorem for the test statistic under general alternatives. Through simulation studies, we demonstrate its superior performance against other common and well-known multisample tests. The method is applied to single cell transcriptomics data obtained from the peripheral blood, cancer tissue, and tumor-adjacent normal tissue of human subjects with hepatocellular carcinoma and non-small-cell lung cancer. Our method unveils patterns in how biochemical metabolic pathways are altered across immune cells in a cancer setting, depending on the tissue location. All of the methods described herein are implemented in the R package multicross. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 627-638 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1791131 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1791131 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:627-638 Template-Type: ReDIF-Article 1.0 Author-Name: Koen Jochmans Author-X-Name-First: Koen Author-X-Name-Last: Jochmans Title: Heteroscedasticity-Robust Inference in Linear Regression Models With Many Covariates Abstract: We consider inference in linear regression models that is robust to heteroscedasticity and the presence of many control variables. When the number of control variables increases at the same rate as the sample size the usual heteroscedasticity-robust estimators of the covariance matrix are inconsistent. Hence, tests based on these estimators are size distorted even in large samples. An alternative covariance-matrix estimator for such a setting is presented that complements recent work by Cattaneo, Jansson, and Newey. We provide high-level conditions for our approach to deliver (asymptotically) size-correct inference as well as more primitive conditions for three special cases. Simulation results and an empirical illustration to inference on the union premium are also provided. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 887-896 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1831924 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1831924 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:887-896 Template-Type: ReDIF-Article 1.0 Author-Name: Yang Ni Author-X-Name-First: Yang Author-X-Name-Last: Ni Title: Bayesian Thinking in Biostatistics. Journal: Journal of the American Statistical Association Pages: 1041-1042 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2022.2069442 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2069442 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:1041-1042 Template-Type: ReDIF-Article 1.0 Author-Name: Gery Geenens Author-X-Name-First: Gery Author-X-Name-Last: Geenens Author-Name: Pierre Lafaye de Micheaux Author-X-Name-First: Pierre Author-X-Name-Last: Lafaye de Micheaux Title: The Hellinger Correlation Abstract: In this article, the defining properties of any valid measure of the dependence between two continuous random variables are revisited and complemented with two original ones, shown to imply other usual postulates. While other popular choices are proved to violate some of these requirements, a class of dependence measures satisfying all of them is identified. One particular measure, that we call the Hellinger correlation, appears as a natural choice within that class due to both its theoretical and intuitive appeal. A simple and efficient nonparametric estimator for that quantity is proposed, with its implementation publicly available in the R package HellCor. Synthetic and real-data examples illustrate the descriptive ability of the measure, which can also be used as test statistic for exact independence testing. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 639-653 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1791132 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1791132 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:639-653 Template-Type: ReDIF-Article 1.0 Author-Name: Xuan Bi Author-X-Name-First: Xuan Author-X-Name-Last: Bi Author-Name: Long Feng Author-X-Name-First: Long Author-X-Name-Last: Feng Author-Name: Cai Li Author-X-Name-First: Cai Author-X-Name-Last: Li Author-Name: Heping Zhang Author-X-Name-First: Heping Author-X-Name-Last: Zhang Title: Modeling Pregnancy Outcomes Through Sequentially Nested Regression Models Abstract: The polycystic ovary syndrome (PCOS) is a most common cause of infertility among women of reproductive age. Unfortunately, the etiology of PCOS is poorly understood. Large-scale clinical trials for pregnancy in polycystic ovary syndrome (PPCOS) were conducted to evaluate the effectiveness of treatments. Ovulation, pregnancy, and live birth are three sequentially nested binary outcomes, typically analyzed separately. However, the separate models may lose power in detecting the treatment effects and influential variables for live birth, due to decreased sample sizes and unbalanced event counts. It has been a long-held hypothesis among the clinicians that some of the important variables for early pregnancy outcomes may continue their influence on live birth. To consider this possibility, we develop an l0 -norm based regularization method in favor of variables that have been identified from an earlier stage. Our approach explicitly bridges the connections across nested outcomes through computationally easy algorithms and enjoys theoretical guarantee of estimation and variable selection. By analyzing the PPCOS data, we successfully uncover the hidden influence of risk factors on live birth, which confirm clinical experience. Moreover, we provide novel infertility treatment recommendations (e.g., letrozole vs. clomiphene citrate) for women with PCOS to improve their chances of live birth. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 602-616 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2021.2006666 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2006666 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:602-616 Template-Type: ReDIF-Article 1.0 Author-Name: M. Hallin Author-X-Name-First: M. Author-X-Name-Last: Hallin Author-Name: D. La Vecchia Author-X-Name-First: D. Author-X-Name-Last: La Vecchia Author-Name: H. Liu Author-X-Name-First: H. Author-X-Name-Last: Liu Title: Center-Outward R-Estimation for Semiparametric VARMA Models Abstract: We propose a new class of R-estimators for semiparametric VARMA models in which the innovation density plays the role of the nuisance parameter. Our estimators are based on the novel concepts of multivariate center-outward ranks and signs. We show that these concepts, combined with Le Cam’s asymptotic theory of statistical experiments, yield a class of semiparametric estimation procedures, which are efficient (at a given reference density), root-n consistent, and asymptotically normal under a broad class of (possibly non-elliptical) actual innovation densities. No kernel density estimation is required to implement our procedures. A Monte Carlo comparative study of our R-estimators and other routinely applied competitors demonstrates the benefits of the novel methodology, in large and small sample. Proofs, computational aspects, and further numerical results are available in the supplementary materials. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 925-938 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1832501 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1832501 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:925-938 Template-Type: ReDIF-Article 1.0 Author-Name: Luella Fu Author-X-Name-First: Luella Author-X-Name-Last: Fu Author-Name: Bowen Gang Author-X-Name-First: Bowen Author-X-Name-Last: Gang Author-Name: Gareth M. James Author-X-Name-First: Gareth M. Author-X-Name-Last: James Author-Name: Wenguang Sun Author-X-Name-First: Wenguang Author-X-Name-Last: Sun Title: Heteroscedasticity-Adjusted Ranking and Thresholding for Large-Scale Multiple Testing Abstract: Standardization has been a widely adopted practice in multiple testing, for it takes into account the variability in sampling and makes the test statistics comparable across different study units. However, despite conventional wisdom to the contrary, we show that there can be a significant loss in information from basing hypothesis tests on standardized statistics rather than the full data. We develop a new class of heteroscedasticity-adjusted ranking and thresholding (HART) rules that aim to improve existing methods by simultaneously exploiting commonalities and adjusting heterogeneities among the study units. The main idea of HART is to bypass standardization by directly incorporating both the summary statistic and its variance into the testing procedure. A key message is that the variance structure of the alternative distribution, which is subsumed under standardized statistics, is highly informative and can be exploited to achieve higher power. The proposed HART procedure is shown to be asymptotically valid and optimal for false discovery rate (FDR) control. Our simulation results demonstrate that HART achieves substantial power gain over existing methods at the same FDR level. We illustrate the implementation through a microarray analysis of myeloma. Journal: Journal of the American Statistical Association Pages: 1028-1040 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1840992 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1840992 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:1028-1040 Template-Type: ReDIF-Article 1.0 Author-Name: Zilin Li Author-X-Name-First: Zilin Author-X-Name-Last: Li Author-Name: Yaowu Liu Author-X-Name-First: Yaowu Author-X-Name-Last: Liu Author-Name: Xihong Lin Author-X-Name-First: Xihong Author-X-Name-Last: Lin Title: Simultaneous Detection of Signal Regions Using Quadratic Scan Statistics With Applications to Whole Genome Association Studies Abstract: We consider in this article detection of signal regions associated with disease outcomes in whole genome association studies. Gene- or region-based methods have become increasingly popular in whole genome association analysis as a complementary approach to traditional individual variant analysis. However, these methods test for the association between an outcome and the genetic variants in a prespecified region, for example, a gene. In view of massive intergenic regions in whole genome sequencing (WGS) studies, we propose a computationally efficient quadratic scan (Q-SCAN) statistic based method to detect the existence and the locations of signal regions by scanning the genome continuously. The proposed method accounts for the correlation (linkage disequilibrium) among genetic variants, and allows for signal regions to have both causal and neutral variants, and the effects of signal variants to be in different directions. We study the asymptotic properties of the proposed Q-SCAN statistics. We derive an empirical threshold that controls for the family-wise error rate, and show that under regularity conditions the proposed method consistently selects the true signal regions. We perform simulation studies to evaluate the finite sample performance of the proposed method. Our simulation results show that the proposed procedure outperforms the existing methods, especially when signal regions have causal variants whose effects are in different directions, or are contaminated with neutral variants. We illustrate Q-SCAN by analyzing the WGS data from the Atherosclerosis Risk in Communities study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 823-834 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1822849 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1822849 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:823-834 Template-Type: ReDIF-Article 1.0 Author-Name: Eftychia Solea Author-X-Name-First: Eftychia Author-X-Name-Last: Solea Author-Name: Bing Li Author-X-Name-First: Bing Author-X-Name-Last: Li Title: Copula Gaussian Graphical Models for Functional Data Abstract: We introduce a statistical graphical model for multivariate functional data, which are common in medical applications such as EEG and fMRI. Recently published functional graphical models rely on the multivariate Gaussian process assumption, but we relax it by introducing the functional copula Gaussian graphical model (FCGGM). This model removes the marginal Gaussian assumption but retains the simplicity of the Gaussian dependence structure, which is particularly attractive for large data. We develop four estimators for the FCGGM and establish the consistency and the convergence rates of one of them. We compare our FCGGM with the existing functional Gaussian graphical model by simulations, and apply our method to an EEG dataset to construct brain networks. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 781-793 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1817750 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1817750 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:781-793 Template-Type: ReDIF-Article 1.0 Author-Name: Luca Frigau Author-X-Name-First: Luca Author-X-Name-Last: Frigau Author-Name: Qiuyi Wu Author-X-Name-First: Qiuyi Author-X-Name-Last: Wu Author-Name: David Banks Author-X-Name-First: David Author-X-Name-Last: Banks Title: Optimizing the JSM Program Abstract: Sometimes the Joint Statistical Meetings (JSM) is frustrating to attend, because multiple sessions on the same topic are scheduled at the same time. This article uses seeded latent Dirichlet allocation and a scheduling optimization algorithm to very significantly reduce overlapping content in the original schedule for the 2020 JSM program. Specifically, a measure based on total variation distance that ranges from 0 (random scheduling) to 1 (no overlapping content) finds that the original schedule had a score of 0.058, whereas our proposed schedule achieved a score of 0.371. This is a huge improvement that would (i) increase participant satisfaction as measured by the post-JSM satisfaction survey, and (ii) save the American Statistical Association significant money by obviating the need for the traditional in-person meeting of the 47 program chairs and other organizers. The methodology developed in this work immediately applies to future JSMs and is easily modified to improve scheduling for any other scientific conference that has parallel sessions. Journal: Journal of the American Statistical Association Pages: 617-626 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2021.1978466 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1978466 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:617-626 Template-Type: ReDIF-Article 1.0 Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Jianhua Guo Author-X-Name-First: Jianhua Author-X-Name-Last: Guo Author-Name: Shurong Zheng Author-X-Name-First: Shurong Author-X-Name-Last: Zheng Title: Estimating Number of Factors by Adjusted Eigenvalues Thresholding Abstract: Determining the number of common factors is an important and practical topic in high-dimensional factor models. The existing literature is mainly based on the eigenvalues of the covariance matrix. Owing to the incomparability of the eigenvalues of the covariance matrix caused by the heterogeneous scales of the observed variables, it is not easy to find an accurate relationship between these eigenvalues and the number of common factors. To overcome this limitation, we appeal to the correlation matrix and demonstrate, surprisingly, that the number of eigenvalues greater than 1 of the population correlation matrix is the same as the number of common factors under certain mild conditions. To use such a relationship, we study random matrix theory based on the sample correlation matrix to correct biases in estimating the top eigenvalues and to take into account of estimation errors in eigenvalue estimation. Thus, we propose a tuning-free scale-invariant adjusted correlation thresholding (ACT) method for determining the number of common factors in high-dimensional factor models, taking into account the sampling variabilities and biases of top sample eigenvalues. We also establish the optimality of the proposed ACT method in terms of minimal signal strength and the optimal threshold. Simulation studies lend further support to our proposed method and show that our estimator outperforms competing methods in most test cases. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 852-861 Issue: 538 Volume: 117 Year: 2022 Month: 4 X-DOI: 10.1080/01621459.2020.1825448 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1825448 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:852-861 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2096039_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Marianna Pensky Author-X-Name-First: Marianna Author-X-Name-Last: Pensky Title: Discussion of “Confidence Intervals for Nonparametric Empirical Bayes Analysis” Journal: Journal of the American Statistical Association Pages: 1183-1185 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2096039 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2096039 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1183-1185 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2087659_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Chenguang Dai Author-X-Name-First: Chenguang Author-X-Name-Last: Dai Author-Name: Jeremy Heng Author-X-Name-First: Jeremy Author-X-Name-Last: Heng Author-Name: Pierre E. Jacob Author-X-Name-First: Pierre E. Author-X-Name-Last: Jacob Author-Name: Nick Whiteley Author-X-Name-First: Nick Author-X-Name-Last: Whiteley Title: An Invitation to Sequential Monte Carlo Samplers Abstract: Statisticians often use Monte Carlo methods to approximate probability distributions, primarily with Markov chain Monte Carlo and importance sampling. Sequential Monte Carlo samplers are a class of algorithms that combine both techniques to approximate distributions of interest and their normalizing constants. These samplers originate from particle filtering for state space models and have become general and scalable sampling techniques. This article describes sequential Monte Carlo samplers and their possible implementations, arguing that they remain under-used in statistics, despite their ability to perform sequential inference and to leverage parallel processing resources among other potential benefits. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1587-1600 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2087659 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2087659 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1587-1600 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1862670_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Belmiro P. M. Duarte Author-X-Name-First: Belmiro P. M. Author-X-Name-Last: Duarte Author-Name: Anthony C. Atkinson Author-X-Name-First: Anthony C. Author-X-Name-Last: Atkinson Author-Name: José F. O. Granjo Author-X-Name-First: José F. O. Author-X-Name-Last: Granjo Author-Name: Nuno M. C. Oliveira Author-X-Name-First: Nuno M. C. Author-X-Name-Last: Oliveira Title: Optimal Design of Experiments for Implicit Models Abstract: Explicit models representing the response variables as functions of the control variables are standard in virtually all scientific fields. For these models, there is a vast literature on the optimal design of experiments (ODoE) to provide good estimates of the parameters with the use of minimal resources. Contrarily, the ODoE for implicit models is more complex and has not been systematically addressed. Nevertheless, there are practical examples where the models relating the response variables, the parameters and the factors are implicit or hardly convertible into an explicit form. We propose a general formulation for developing the theory of the ODoE for implicit algebraic models to specifically find continuous local designs. The treatment relies on converting the ODoE problem into an optimization problem of the nonlinear programming (NLP) class which includes the construction of the parameter sensitivities and the Cholesky decomposition of the Fisher information matrix. The NLP problem generated has multiple local optima, and we use global solvers, combined with an equivalence theorem from the theory of ODoE, to ensure the global optimality of our continuous optimal designs. We consider D- and A-optimality criteria and apply the approach to five examples of practical interest in chemistry and thermodynamics. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1424-1437 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1862670 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1862670 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1424-1437 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2040519_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Peng Shi Author-X-Name-First: Peng Author-X-Name-Last: Shi Author-Name: Gee Y. Lee Author-X-Name-First: Gee Y. Author-X-Name-Last: Lee Title: Copula Regression for Compound Distributions with Endogenous Covariates with Applications in Insurance Deductible Pricing Abstract: This article concerns deductible pricing in nonlife insurance contracts. The primary interest of insurers is the effect of the contract deductible on a policyholder’s aggregate loss that is determined by a compound distribution where the sum of individual claim amount is stopped by the number of claims. Policyholders choose the deductible level based on their hidden risks, which makes deductible endogenous in the regressions for both claim frequency and claim severity. To address the endogeneity in the regression for the compound aggregate loss, we introduce a novel approach using pair copula constructions to jointly model the policyholder’s deductible, number of claims, and individual claim amounts, in the context of compound distributions. The proposed method provides insurers an empirical tool to uncover the underlying risk distribution of the potential customers.In the application we consider an insurance portfolio from the property insurance program that provides property coverage for building and contents of local government entities of the Wisconsin. Using the historical data on policyholder and insurance claims, we first provide empirical evidence of the endogeneity of the deductible. Interestingly, we find that the policyholder’s deductible is negatively associated with the claim frequency but positively associated with the claim severity. For the portfolio of policyholders, the endogenous deductible model provides superior prediction for 65% and 71% of policyholders for claim frequency and severity, respectively. The endogeneity of deductible shows significant managerial implications on insurance operations. In particular, the risk score suggested by the proposed method allows the insurer to identify additional profitable underwriting strategies which are quantified by the Gini indices of 0.22 and 0.13 when switching from the exogenous deductible premium and the insurer’s contract premium, respectively. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1094-1109 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2040519 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2040519 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1094-1109 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2093726_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Subhashis Ghosal Author-X-Name-First: Subhashis Author-X-Name-Last: Ghosal Title: Discussion of “Confidence Intervals for Nonparametric Empirical Bayes Analysis” by Ignatiadis and Wager Journal: Journal of the American Statistical Association Pages: 1171-1174 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2093726 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2093726 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1171-1174 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2008403_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Nikolaos Ignatiadis Author-X-Name-First: Nikolaos Author-X-Name-Last: Ignatiadis Author-Name: Stefan Wager Author-X-Name-First: Stefan Author-X-Name-Last: Wager Title: Confidence Intervals for Nonparametric Empirical Bayes Analysis Abstract: In an empirical Bayes analysis, we use data from repeated sampling to imitate inferences made by an oracle Bayesian with extensive knowledge of the data-generating distribution. Existing results provide a comprehensive characterization of when and why empirical Bayes point estimates accurately recover oracle Bayes behavior. In this paper, we develop flexible and practical confidence intervals that provide asymptotic frequentist coverage of empirical Bayes estimands, such as the posterior mean or the local false sign rate. The coverage statements hold even when the estimands are only partially identified or when empirical Bayes point estimates converge very slowly. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1149-1166 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2021.2008403 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2008403 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1149-1166 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1858838_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Likun Zhang Author-X-Name-First: Likun Author-X-Name-Last: Zhang Author-Name: Benjamin A. Shaby Author-X-Name-First: Benjamin A. Author-X-Name-Last: Shaby Author-Name: Jennifer L. Wadsworth Author-X-Name-First: Jennifer L. Author-X-Name-Last: Wadsworth Title: Hierarchical Transformed Scale Mixtures for Flexible Modeling of Spatial Extremes on Datasets With Many Locations Abstract: Abstract–Flexible spatial models that allow transitions between tail dependence classes have recently appeared in the literature. However, inference for these models is computationally prohibitive, even in moderate dimensions, due to the necessity of repeatedly evaluating the multivariate Gaussian distribution function. In this work, we attempt to achieve truly high-dimensional inference for extremes of spatial processes, while retaining the desirable flexibility in the tail dependence structure, by modifying an established class of models based on scale mixtures Gaussian processes. We show that the desired extremal dependence properties from the original models are preserved under the modification, and demonstrate that the corresponding Bayesian hierarchical model does not involve the expensive computation of the multivariate Gaussian distribution function. We fit our model to exceedances of a high threshold, and perform coverage analyses and cross-model checks to validate its ability to capture different types of tail characteristics. We use a standard adaptive Metropolis algorithm for model fitting, and further accelerate the computation via parallelization and Rcpp. Lastly, we apply the model to a dataset of a fire threat index on the Great Plains region of the United States, which is vulnerable to massively destructive wildfires. We find that the joint tail of the fire threat index exhibits a decaying dependence structure that cannot be captured by limiting extreme value models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1357-1369 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1858838 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1858838 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1357-1369 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2101797_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Yang Zhou Author-X-Name-First: Yang Author-X-Name-Last: Zhou Author-Name: Lirong Xue Author-X-Name-First: Lirong Author-X-Name-Last: Xue Author-Name: Zhengyu Shi Author-X-Name-First: Zhengyu Author-X-Name-Last: Shi Author-Name: Libo Wu Author-X-Name-First: Libo Author-X-Name-Last: Wu Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Title: Rejoinder Journal: Journal of the American Statistical Association Pages: 1066-1067 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2101797 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2101797 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1066-1067 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1859379_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: T. Tony Cai Author-X-Name-First: T. Tony Author-X-Name-Last: Cai Author-Name: Wenguang Sun Author-X-Name-First: Wenguang Author-X-Name-Last: Sun Author-Name: Yin Xia Author-X-Name-First: Yin Author-X-Name-Last: Xia Title: LAWS: A Locally Adaptive Weighting and Screening Approach to Spatial Multiple Testing Abstract: Exploiting spatial patterns in large-scale multiple testing promises to improve both power and interpretability of false discovery rate (FDR) analyses. This article develops a new class of locally adaptive weighting and screening (LAWS) rules that directly incorporates useful local patterns into inference. The idea involves constructing robust and structure-adaptive weights according to the estimated local sparsity levels. LAWS provides a unified framework for a broad range of spatial problems and is fully data-driven. It is shown that LAWS controls the FDR asymptotically under mild conditions on dependence. The finite sample performance is investigated using simulated data, which demonstrates that LAWS controls the FDR and outperforms existing methods in power. The efficiency gain is substantial in many settings. We further illustrate the merits of LAWS through applications to the analysis of two-dimensional and three-dimensional images. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1370-1383 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1859379 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1859379 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1370-1383 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1865168_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Ted Westling Author-X-Name-First: Ted Author-X-Name-Last: Westling Title: Nonparametric Tests of the Causal Null With Nondiscrete Exposures Abstract: In many scientific studies, it is of interest to determine whether an exposure has a causal effect on an outcome. In observational studies, this is a challenging task due to the presence of confounding variables that affect both the exposure and the outcome. Many methods have been developed to test for the presence of a causal effect when all such confounding variables are observed and when the exposure of interest is discrete. In this article, we propose a class of nonparametric tests of the null hypothesis that there is no average causal effect of an arbitrary univariate exposure on an outcome in the presence of observed confounding. Our tests apply to discrete, continuous, and mixed discrete-continuous exposures. We demonstrate that our proposed tests are doubly robust consistent, that they have correct asymptotic Type I error if both nuisance parameters involved in the problem are estimated at fast enough rates, and that they have power to detect local alternatives approaching the null at the rate n−1/2. We study the performance of our tests in numerical studies, and use them to test for the presence of a causal effect of BMI on immune response in early phase vaccine trials. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1551-1562 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1865168 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1865168 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1551-1562 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1855183_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Di Wang Author-X-Name-First: Di Author-X-Name-Last: Wang Author-Name: Yao Zheng Author-X-Name-First: Yao Author-X-Name-Last: Zheng Author-Name: Heng Lian Author-X-Name-First: Heng Author-X-Name-Last: Lian Author-Name: Guodong Li Author-X-Name-First: Guodong Author-X-Name-Last: Li Title: High-Dimensional Vector Autoregressive Time Series Modeling via Tensor Decomposition Abstract: The classical vector autoregressive model is a fundamental tool for multivariate time series analysis. However, it involves too many parameters when the number of time series and lag order are even moderately large. This article proposes to rearrange the transition matrices of the model into a tensor form such that the parameter space can be restricted along three directions simultaneously via tensor decomposition. In contrast, the reduced-rank regression method can restrict the parameter space in only one direction. Besides achieving substantial dimension reduction, the proposed model is interpretable from the factor modeling perspective. Moreover, to handle high-dimensional time series, this article considers imposing sparsity on factor matrices to improve the model interpretability and estimation efficiency, which leads to a sparsity-inducing estimator. For the low-dimensional case, we derive asymptotic properties of the proposed least squares estimator and introduce an alternating least squares algorithm. For the high-dimensional case, we establish nonasymptotic properties of the sparsity-inducing estimator and propose an ADMM algorithm for regularized estimation. Simulation experiments and a real data example demonstrate the advantages of the proposed approach over various existing methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1338-1356 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1855183 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1855183 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1338-1356 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2053136_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Xu Guo Author-X-Name-First: Xu Author-X-Name-Last: Guo Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Author-Name: Jingyuan Liu Author-X-Name-First: Jingyuan Author-X-Name-Last: Liu Author-Name: Mudong Zeng Author-X-Name-First: Mudong Author-X-Name-Last: Zeng Title: High-Dimensional Mediation Analysis for Selecting DNA Methylation Loci Mediating Childhood Trauma and Cortisol Stress Reactivity Abstract: Childhood trauma tends to influence cortisol stress reactivity through the mediating effects of DNA methylation. Houtepen et al. conducted a study to investigate the role of DNA methylation in cortisol stress reactivity and its relationship with childhood trauma. The study collected a dataset consisting of 385,882 DNA methylation loci, cortisol stress reactivity, one-dimensional score on a childhood trauma questionnaire and several covariates for 85 healthy individuals. Of great scientific interest is to identify the active mediating loci out of the 385,882 ones. Houtepen et al. conducted 385,882 linear mediation analyses, in each of which one locus was considered, and identified three active mediating loci. More recently, van Kesteren and Oberski proposed a coordinate-wise mediation filter (CMF) and applied it to the same dataset. They identified five active mediating loci. Unfortunately, the three loci identified by Houtepen et al. are completely different from the five loci identified by van Kesteren and Oberski, probably because both Houtepen et al. and van Kesteren and Oberski did not consider all loci jointly in their analyses. The high dimensional DNA methylation loci indeed necessitate new techniques for identifying active mediating loci and testing the direct and indirect effects of the early life traumatic stress on later cortisol alteration. Motivated by the contradictory results in the aforementioned two scientific works, we develop a new estimating and testing procedure, and apply it to the same dataset as that analyzed by the two works. We identify three new loci: cg19230917, cg06422529 and cg03199124, and their effect sizes and p-values are 321.196 (p-value = 0.035965), 418.173 (p-value = 0.000234) and 471.865 (p-value = 0.001691), respectively. These three loci possess both reasonably neurobiological interpretations and statistically significant effects via our proposed tests. Based on our new procedure, we further confirm that the childhood trauma does not have significant direct effects on cortisol change—it only indirectly affects cortisol through DNA methylation, and the indirect effect is negative. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1110-1121 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2053136 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2053136 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1110-1121 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2093727_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Dongyue Xie Author-X-Name-First: Dongyue Author-X-Name-Last: Xie Author-Name: Matthew Stephens Author-X-Name-First: Matthew Author-X-Name-Last: Stephens Title: Discussion of “Confidence Intervals for Nonparametric Empirical Bayes Analysis” Journal: Journal of the American Statistical Association Pages: 1186-1191 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2093727 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2093727 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1186-1191 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1851236_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Anders Bredahl Kock Author-X-Name-First: Anders Bredahl Author-X-Name-Last: Kock Author-Name: David Preinerstorfer Author-X-Name-First: David Author-X-Name-Last: Preinerstorfer Author-Name: Bezirgen Veliyev Author-X-Name-First: Bezirgen Author-X-Name-Last: Veliyev Title: Functional Sequential Treatment Allocation Abstract: Consider a setting in which a policy maker assigns subjects to treatments, observing each outcome before the next subject arrives. Initially, it is unknown which treatment is best, but the sequential nature of the problem permits learning about the effectiveness of the treatments. While the multi-armed-bandit literature has shed much light on the situation when the policy maker compares the effectiveness of the treatments through their mean, much less is known about other targets. This is restrictive, because a cautious decision maker may prefer to target a robust location measure such as a quantile or a trimmed mean. Furthermore, socio-economic decision making often requires targeting purpose specific characteristics of the outcome distribution, such as its inherent degree of inequality, welfare or poverty. In the present article, we introduce and study sequential learning algorithms when the distributional characteristic of interest is a general functional of the outcome distribution. Minimax expected regret optimality results are obtained within the subclass of explore-then-commit policies, and for the unrestricted class of all policies. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1311-1323 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1851236 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1851236 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1311-1323 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1844719_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Ben Dai Author-X-Name-First: Ben Author-X-Name-Last: Dai Author-Name: Xiaotong Shen Author-X-Name-First: Xiaotong Author-X-Name-Last: Shen Author-Name: Wing Wong Author-X-Name-First: Wing Author-X-Name-Last: Wong Title: Coupled Generation Abstract: Instance generation creates representative examples to interpret a learning model, as in regression and classification. For example, representative sentences of a topic of interest describe the topic specifically for sentence categorization. In such a situation, a large number of unlabeled observations may be available in addition to labeled data, for example, many unclassified text corpora (unlabeled instances) are available with only a few classified sentences (labeled instances). In this article, we introduce a novel generative method, called a coupled generator, producing instances given a specific learning outcome, based on indirect and direct generators. The indirect generator uses the inverse principle to yield the corresponding inverse probability, enabling to generate instances by leveraging an unlabeled data. The direct generator learns the distribution of an instance given its learning outcome. Then, the coupled generator seeks the best one from the indirect and direct generators, which is designed to enjoy the benefits of both and deliver higher generation accuracy. For sentence generation given a topic, we develop an embedding-based regression/classification in conjuncture with an unconditional recurrent neural network for the indirect generator, whereas a conditional recurrent neural network is natural for the corresponding direct generator. Moreover, we derive finite-sample generation error bounds for the indirect and direct generators to reveal the generative aspects of both methods thus explaining the benefits of the coupled generator. Finally, we apply the proposed methods to a real benchmark of abstract classification and demonstrate that the coupled generator composes reasonably good sentences from a dictionary to describe a specific topic of interest. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1243-1253 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1844719 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1844719 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1243-1253 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2098134_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Hyunwoo Park Author-X-Name-First: Hyunwoo Author-X-Name-Last: Park Title: A History of Data Visualization and Graphic Communication, Journal: Journal of the American Statistical Association Pages: 1601-1603 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2098134 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2098134 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1601-1603 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1862668_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Zhaoxing Gao Author-X-Name-First: Zhaoxing Author-X-Name-Last: Gao Author-Name: Ruey S. Tsay Author-X-Name-First: Ruey S. Author-X-Name-Last: Tsay Title: Modeling High-Dimensional Time Series: A Factor Model With Dynamically Dependent Factors and Diverging Eigenvalues Abstract: This article proposes a new approach to modeling high-dimensional time series by treating a p-dimensional time series as a nonsingular linear transformation of certain common factors and idiosyncratic components. Unlike the approximate factor models, we assume that the factors capture all the nontrivial dynamics of the data, but the cross-sectional dependence may be explained by both the factors and the idiosyncratic components. Under the proposed model, (a) the factor process is dynamically dependent and the idiosyncratic component is a white noise process, and (b) the largest eigenvalues of the covariance matrix of the idiosyncratic components may diverge to infinity as the dimension p increases. We propose a white noise testing procedure for high-dimensional time series to determine the number of white noise components and, hence, the number of common factors, and introduce a projected principal component analysis (PCA) to eliminate the diverging effect of the idiosyncratic noises. Asymptotic properties of the proposed method are established for both fixed p and diverging p as the sample size n increases to infinity. We use both simulated data and real examples to assess the performance of the proposed method. We also compare our method with two commonly used methods in the literature concerning the forecastability of the extracted factors and find that the proposed approach not only provides interpretable results, but also performs well in out-of-sample forecasting. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1398-1414 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1862668 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1862668 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1398-1414 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1863222_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Raiden B. Hasegawa Author-X-Name-First: Raiden B. Author-X-Name-Last: Hasegawa Author-Name: Dylan S. Small Author-X-Name-First: Dylan S. Author-X-Name-Last: Small Title: Estimating Malaria Vaccine Efficacy in the Absence of a Gold Standard Case Definition: Mendelian Factorial Design Abstract: Accurate estimates of malaria vaccine efficacy require a reliable definition of a malaria case. However, the symptoms of clinical malaria are unspecific, overlapping with other childhood illnesses. Additionally, children in endemic areas tolerate varying levels of parasitemia without symptoms. Together, this makes finding a gold-standard case definition challenging. We present a method to identify and estimate malaria vaccine efficacy that does not require an observable gold-standard case definition. Instead, we leverage genetic traits that are protective against malaria but not against other illnesses, for example, the sickle cell trait, to identify vaccine efficacy in a randomized trial. Inspired by Mendelian randomization, we introduce Mendelian factorial design, a method that augments a randomized trial with genetic variation to produce a natural factorial experiment, which identifies vaccine efficacy under realistic assumptions. A robust, covariance adjusted estimation procedure is developed for estimating vaccine efficacy on the risk ratio and incidence rate ratio scales. Simulations suggest that our estimator has good performance whereas standard methods are systematically biased. We demonstrate that a combined estimator using both our proposed estimator and the standard approach yields significant improvements when the Mendelian factor is only weakly protective. Our method can be applied in vaccine and prevention trials of other childhood diseases that have no gold-standard case definition and known genetic risk factors. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1466-1481 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1863222 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1863222 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1466-1481 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2102501_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Guido Imbens Author-X-Name-First: Guido Author-X-Name-Last: Imbens Title: Comment on: “Confidence Intervals for Nonparametric Empirical Bayes Analysis” by Ignatiadis and Wager Journal: Journal of the American Statistical Association Pages: 1181-1182 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2102501 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2102501 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1181-1182 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1864382_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Matteo Bonvini Author-X-Name-First: Matteo Author-X-Name-Last: Bonvini Author-Name: Edward H. Kennedy Author-X-Name-First: Edward H. Author-X-Name-Last: Kennedy Title: Sensitivity Analysis via the Proportion of Unmeasured Confounding Abstract: In observational studies, identification of ATEs is generally achieved by assuming that the correct set of confounders has been measured and properly included in the relevant models. Because this assumption is both strong and untestable, a sensitivity analysis should be performed. Common approaches include modeling the bias directly or varying the propensity scores to probe the effects of a potential unmeasured confounder. In this article, we take a novel approach whereby the sensitivity parameter is the “proportion of unmeasured confounding”: the proportion of units for whom the treatment is not as good as randomized even after conditioning on the observed covariates. We consider different assumptions on the probability of a unit being unconfounded. In each case, we derive sharp bounds on the average treatment effect as a function of the sensitivity parameter and propose nonparametric estimators that allow flexible covariate adjustment. We also introduce a one-number summary of a study’s robustness to the number of confounded units. Finally, we explore finite-sample properties via simulation, and apply the methods to an observational database used to assess the effects of right heart catheterization. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1540-1550 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1864382 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1864382 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1540-1550 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2055559_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Yingtian Hu Author-X-Name-First: Yingtian Author-X-Name-Last: Hu Author-Name: Mahmoud Zeydabadinezhad Author-X-Name-First: Mahmoud Author-X-Name-Last: Zeydabadinezhad Author-Name: Longchuan Li Author-X-Name-First: Longchuan Author-X-Name-Last: Li Author-Name: Ying Guo Author-X-Name-First: Ying Author-X-Name-Last: Guo Title: A Multimodal Multilevel Neuroimaging Model for Investigating Brain Connectome Development Abstract: Recent advancements of multimodal neuroimaging such as functional MRI (fMRI) and diffusion MRI (dMRI) offers unprecedented opportunities to understand brain development. Most existing neurodevelopmental studies focus on using a single imaging modality to study microstructure or neural activations in localized brain regions. The developmental changes of brain network architecture in childhood and adolescence are not well understood. Our study made use of dMRI and resting-state fMRI imaging data sets from Philadelphia Neurodevelopmental Cohort (PNC) study to characterize developmental changes in both structural as well as functional brain connectomes. A multimodal multilevel model (MMM) is developed and implemented in PNC study to investigate brain maturation in both white matter structural connection and intrinsic functional connection. MMM addresses several major challenges in multimodal connectivity analysis. First, by using a first-level data generative model for observed measures and a second-level latent network modeling, MMM effectively infers underlying connection states from noisy imaging-based connectivity measurements. Second, MMM models the interplay between the structural and functional connections to capture the relationship between different brain connectomes. Third, MMM incorporates covariate effects in the network modeling to investigate network heterogeneity across subpopoulations. Finally, by using a module-wise parameterization based on brain network topology, MMM is scalable to whole-brain connectomics. MMM analysis of the PNC study generates new insights in neurodevelopment during adolescence including revealing the majority of the white fiber connectivity growth are related to the cognitive networks where the most significant increase is found between the default mode and the executive control network with a 15% increase in the probability of structural connections. We also uncover functional connectome development mainly derived from global functional integration rather than direct anatomical connections. To the best of our knowledge, these findings have not been reported in the literature using multimodal connectomics. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1134-1148 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2055559 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2055559 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1134-1148 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1859380_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Laura Jula Vanegas Author-X-Name-First: Laura Author-X-Name-Last: Jula Vanegas Author-Name: Merle Behr Author-X-Name-First: Merle Author-X-Name-Last: Behr Author-Name: Axel Munk Author-X-Name-First: Axel Author-X-Name-Last: Munk Title: Multiscale Quantile Segmentation Abstract: We introduce a new methodology for analyzing serial data by quantile regression assuming that the underlying quantile function consists of constant segments. The procedure does not rely on any distributional assumption besides serial independence. It is based on a multiscale statistic, which allows to control the (finite sample) probability for selecting the correct number of segments S at a given error level, which serves as a tuning parameter. For a proper choice of this parameter, this probability tends exponentially fast to one, as sample size increases. We further show that the location and size of segments are estimated at minimax optimal rate (compared to a Gaussian setting) up to a log-factor. Thereby, our approach leads to (asymptotically) uniform confidence bands for the entire quantile regression function in a fully nonparametric setup. The procedure is efficiently implemented using dynamic programming techniques with double heap structures, and software is provided. Simulations and data examples from genetic sequencing and ion channel recordings confirm the robustness of the proposed procedure, which at the same time reliably detects changes in quantiles from arbitrary distributions with precise statistical guarantees. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1384-1397 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1859380 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1859380 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1384-1397 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2093728_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Peter Hoff Author-X-Name-First: Peter Author-X-Name-Last: Hoff Title: Coverage Properties of Empirical Bayes Intervals: A Discussion of “Confidence Intervals for Nonparametric Empirical Bayes Analysis” by Ignatiadis and Wager Journal: Journal of the American Statistical Association Pages: 1175-1178 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2093728 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2093728 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1175-1178 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1863223_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Xiao Liu Author-X-Name-First: Xiao Author-X-Name-Last: Liu Author-Name: Kyongmin Yeo Author-X-Name-First: Kyongmin Author-X-Name-Last: Yeo Author-Name: Siyuan Lu Author-X-Name-First: Siyuan Author-X-Name-Last: Lu Title: Statistical Modeling for Spatio-Temporal Data From Stochastic Convection-Diffusion Processes Abstract: This article proposes a physical-statistical modeling approach for spatio-temporal data arising from a class of stochastic convection-diffusion processes. Such processes are widely found in scientific and engineering applications where fundamental physics imposes critical constraints on how data can be modeled and how models should be interpreted. The idea of spectrum decomposition is employed to approximate a physical spatio-temporal process by the linear combination of spatial basis functions and a multivariate random process of spectral coefficients. Unlike existing approaches assuming spatially and temporally invariant convection-diffusion, this article considers a more general scenario with spatially varying convection-diffusion and nonzero-mean source-sink. As a result, the temporal dynamics of spectral coefficients is coupled with each other, which can be interpreted as the nonlinear energy redistribution across multiple scales from the perspective of physics. Because of the spatially varying convection-diffusion, the space-time covariance is nonstationary in space. The theoretical results are integrated into a hierarchical dynamical spatio-temporal model. The connection is established between the proposed model and the existing models based on integro-difference equations. Computational efficiency and scalability are also investigated to make the proposed approach practical. The advantages of the proposed methodology are demonstrated by numerical examples, a case study, and comprehensive comparison studies. Computer code is available on GitHub. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1482-1499 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1863223 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1863223 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1482-1499 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1870984_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Zhengwu Zhang Author-X-Name-First: Zhengwu Author-X-Name-Last: Zhang Author-Name: Xiao Wang Author-X-Name-First: Xiao Author-X-Name-Last: Wang Author-Name: Linglong Kong Author-X-Name-First: Linglong Author-X-Name-Last: Kong Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Title: High-Dimensional Spatial Quantile Function-on-Scalar Regression Abstract: This article develops a novel spatial quantile function-on-scalar regression model, which studies the conditional spatial distribution of a high-dimensional functional response given scalar predictors. With the strength of both quantile regression and copula modeling, we are able to explicitly characterize the conditional distribution of the functional or image response on the whole spatial domain. Our method provides a comprehensive understanding of the effect of scalar covariates on functional responses across different quantile levels and also gives a practical way to generate new images for given covariate values. Theoretically, we establish the minimax rates of convergence for estimating coefficient functions under both fixed and random designs. We further develop an efficient primal-dual algorithm to handle high-dimensional image data. Simulations and real data analysis are conducted to examine the finite-sample performance. Journal: Journal of the American Statistical Association Pages: 1563-1578 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1870984 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1870984 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1563-1578 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1863812_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Kristian Bjørn Hessellund Author-X-Name-First: Kristian Bjørn Author-X-Name-Last: Hessellund Author-Name: Ganggang Xu Author-X-Name-First: Ganggang Author-X-Name-Last: Xu Author-Name: Yongtao Guan Author-X-Name-First: Yongtao Author-X-Name-Last: Guan Author-Name: Rasmus Waagepetersen Author-X-Name-First: Rasmus Author-X-Name-Last: Waagepetersen Title: Semiparametric Multinomial Logistic Regression for Multivariate Point Pattern Data Abstract: We propose a new method for analysis of multivariate point pattern data observed in a heterogeneous environment and with complex intensity functions. We suggest semiparametric models for the intensity functions that depend on an unspecified factor common to all types of points. This is for example well suited for analyzing spatial covariate effects on events such as street crime activities that occur in a complex urban environment. A multinomial conditional composite likelihood function is introduced for estimation of intensity function regression parameters and the asymptotic joint distribution of the resulting estimators is derived under mild conditions. Crucially, the asymptotic covariance matrix depends on ratios of cross pair correlation functions of the multivariate point process. To make valid statistical inference without restrictive assumptions, we construct consistent nonparametric estimators for these ratios. Finally, we construct standardized residual plots, predictive probability plots, and semiparametric intensity plots to validate and to visualize the findings of the model. The effectiveness of the proposed methodology is demonstrated through extensive simulation studies and an application to analyzing the effects of socio-economic and demographical variables on occurrences of street crimes in Washington DC. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1500-1515 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1863812 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1863812 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1500-1515 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1853547_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Tengyuan Liang Author-X-Name-First: Tengyuan Author-X-Name-Last: Liang Author-Name: Hai Tran-Bach Author-X-Name-First: Hai Author-X-Name-Last: Tran-Bach Title: Mehler’s Formula, Branching Process, and Compositional Kernels of Deep Neural Networks Abstract: Abstract–We use a connection between compositional kernels and branching processes via Mehler’s formula to study deep neural networks. This new probabilistic insight provides us a novel perspective on the mathematical role of activation functions in compositional neural networks. We study the unscaled and rescaled limits of the compositional kernels and explore the different phases of the limiting behavior, as the compositional depth increases. We investigate the memorization capacity of the compositional kernels and neural networks by characterizing the interplay among compositional depth, sample size, dimensionality, and nonlinearity of the activation. Explicit formulas on the eigenvalues of the compositional kernel are provided, which quantify the complexity of the corresponding reproducing kernel Hilbert space. On the methodological front, we propose a new random features algorithm, which compresses the compositional layers by devising a new activation function. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1324-1337 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1853547 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1853547 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1324-1337 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2093725_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Bradley Efron Author-X-Name-First: Bradley Author-X-Name-Last: Efron Title: Discussion of “Confidence Intervals for Nonparametric Empirical Bayes Analysis” by Nikolaos Ignatiadis and Stefan Wager Journal: Journal of the American Statistical Association Pages: 1179-1180 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2093725 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2093725 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1179-1180 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2096040_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Noel Cressie Author-X-Name-First: Noel Author-X-Name-Last: Cressie Title: Nonparametric Empirical Bayes Prediction Journal: Journal of the American Statistical Association Pages: 1167-1170 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2096040 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2096040 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1167-1170 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2098135_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Sudipto Banerjee Author-X-Name-First: Sudipto Author-X-Name-Last: Banerjee Title: Discussion of “Measuring Housing Vitality from Multi-Source Big Data and Machine Learning” Journal: Journal of the American Statistical Association Pages: 1063-1065 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2098135 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2098135 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1063-1065 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1847121_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Kolyan Ray Author-X-Name-First: Kolyan Author-X-Name-Last: Ray Author-Name: Botond Szabó Author-X-Name-First: Botond Author-X-Name-Last: Szabó Title: Variational Bayes for High-Dimensional Linear Regression With Sparse Priors Abstract: We study a mean-field spike and slab variational Bayes (VB) approximation to Bayesian model selection priors in sparse high-dimensional linear regression. Under compatibility conditions on the design matrix, oracle inequalities are derived for the mean-field VB approximation, implying that it converges to the sparse truth at the optimal rate and gives optimal prediction of the response vector. The empirical performance of our algorithm is studied, showing that it works comparably well as other state-of-the-art Bayesian variable selection methods. We also numerically demonstrate that the widely used coordinate-ascent variational inference algorithm can be highly sensitive to the parameter updating order, leading to potentially poor performance. To mitigate this, we propose a novel prioritized updating scheme that uses a data-driven updating order and performs better in simulations. The variational algorithm is implemented in the R package sparsevb. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1270-1281 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1847121 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1847121 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1270-1281 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1862671_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Fei Xue Author-X-Name-First: Fei Author-X-Name-Last: Xue Author-Name: Yanqing Zhang Author-X-Name-First: Yanqing Author-X-Name-Last: Zhang Author-Name: Wenzhuo Zhou Author-X-Name-First: Wenzhuo Author-X-Name-Last: Zhou Author-Name: Haoda Fu Author-X-Name-First: Haoda Author-X-Name-Last: Fu Author-Name: Annie Qu Author-X-Name-First: Annie Author-X-Name-Last: Qu Title: Multicategory Angle-Based Learning for Estimating Optimal Dynamic Treatment Regimes With Censored Data Abstract: An optimal dynamic treatment regime (DTR) consists of a sequence of decision rules in maximizing long-term benefits, which is applicable for chronic diseases such as HIV infection or cancer. In this article, we develop a novel angle-based approach to search the optimal DTR under a multicategory treatment framework for survival data. The proposed method targets to maximize the conditional survival function of patients following a DTR. In contrast to most existing approaches which are designed to maximize the expected survival time under a binary treatment framework, the proposed method solves the multicategory treatment problem given multiple stages for censored data. Specifically, the proposed method obtains the optimal DTR via integrating estimations of decision rules at multiple stages into a single multicategory classification algorithm without imposing additional constraints, which is also more computationally efficient and robust. In theory, we establish Fisher consistency and provide the risk bound for the proposed estimator under regularity conditions. Our numerical studies show that the proposed method outperforms competing methods in terms of maximizing the conditional survival probability. We apply the proposed method to two real datasets: Framingham heart study data and acquired immunodeficiency syndrome clinical data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1438-1451 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1862671 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1862671 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1438-1451 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1841646_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Faming Liang Author-X-Name-First: Faming Author-X-Name-Last: Liang Author-Name: Jingnan Xue Author-X-Name-First: Jingnan Author-X-Name-Last: Xue Author-Name: Bochao Jia Author-X-Name-First: Bochao Author-X-Name-Last: Jia Title: Markov Neighborhood Regression for High-Dimensional Inference Abstract: This article proposes an innovative method for constructing confidence intervals and assessing p-values in statistical inference for high-dimensional linear models. The proposed method has successfully broken the high-dimensional inference problem into a series of low-dimensional inference problems: For each regression coefficient βi, the confidence interval and p-value are computed by regressing on a subset of variables selected according to the conditional independence relations between the corresponding variable Xi and other variables. Since the subset of variables forms a Markov neighborhood of Xi in the Markov network formed by all the variables X1,X2,…,Xp, the proposed method is coined as Markov neighborhood regression (MNR). The proposed method is tested on high-dimensional linear, logistic, and Cox regression. The numerical results indicate that the proposed method significantly outperforms the existing ones. Based on the MNR, a method of learning causal structures for high-dimensional linear models is proposed and applied to identification of drug sensitive genes and cancer driver genes. The idea of using conditional independence relations for dimension reduction is general and potentially can be extended to other high-dimensional or big data problems as well. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1200-1214 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1841646 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1841646 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1200-1214 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1875837_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Abdelaati Daouia Author-X-Name-First: Abdelaati Author-X-Name-Last: Daouia Author-Name: Irène Gijbels Author-X-Name-First: Irène Author-X-Name-Last: Gijbels Author-Name: Gilles Stupfler Author-X-Name-First: Gilles Author-X-Name-Last: Stupfler Title: Extremile Regression Abstract: Regression extremiles define a least squares analogue of regression quantiles. They are determined by weighted expectations rather than tail probabilities. Of special interest is their intuitive meaning in terms of expected minima and maxima. Their use appears naturally in risk management where, in contrast to quantiles, they fulfill the coherency axiom and take the severity of tail losses into account. In addition, they are comonotonically additive and belong to both the families of spectral risk measures and concave distortion risk measures. This article provides the first detailed study exploring implications of the extremile terminology in a general setting of presence of covariates. We rely on local linear (least squares) check function minimization for estimating conditional extremiles and deriving the asymptotic normality of their estimators. We also extend extremile regression far into the tails of heavy-tailed distributions. Extrapolated estimators are constructed and their asymptotic theory is developed. Some applications to real data are provided. Journal: Journal of the American Statistical Association Pages: 1579-1586 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2021.1875837 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1875837 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1579-1586 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1864380_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Debmalya Nandy Author-X-Name-First: Debmalya Author-X-Name-Last: Nandy Author-Name: Francesca Chiaromonte Author-X-Name-First: Francesca Author-X-Name-Last: Chiaromonte Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Title: Covariate Information Number for Feature Screening in Ultrahigh-Dimensional Supervised Problems Abstract: Contemporary high-throughput experimental and surveying techniques give rise to ultrahigh-dimensional supervised problems with sparse signals; that is, a limited number of observations (n), each with a very large number of covariates (p≫n) , only a small share of which is truly associated with the response. In these settings, major concerns on computational burden, algorithmic stability, and statistical accuracy call for substantially reducing the feature space by eliminating redundant covariates before the use of any sophisticated statistical analysis. Along the lines of Pearson’s correlation coefficient-based sure independence screening and other model- and correlation-based feature screening methods, we propose a model-free procedure called covariate information number-sure independence screening (CIS). CIS uses a marginal utility connected to the notion of the traditional Fisher information, possesses the sure screening property, and is applicable to any type of response (features) with continuous features (response). Simulations and an application to transcriptomic data on rats reveal the comparative strengths of CIS over some popular feature screening methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1516-1529 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1864380 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1864380 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1516-1529 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2096038_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Yang Zhou Author-X-Name-First: Yang Author-X-Name-Last: Zhou Author-Name: Lirong Xue Author-X-Name-First: Lirong Author-X-Name-Last: Xue Author-Name: Zhengyu Shi Author-X-Name-First: Zhengyu Author-X-Name-Last: Shi Author-Name: Libo Wu Author-X-Name-First: Libo Author-X-Name-Last: Wu Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Title: Measuring Housing Vitality from Multi-Source Big Data and Machine Learning Abstract: Measuring timely high-resolution socioeconomic outcomes is critical for policymaking and evaluation, but hard to reliably obtain. With the help of machine learning and cheaply available data such as social media and nightlight, it is now possible to predict such indices in fine granularity. This article demonstrates an adaptive way to measure the time trend and spatial distribution of housing vitality (number of occupied houses) with the help of multiple easily accessible datasets: energy, nightlight, and land-use data. We first identified the high-frequency housing occupancy status from energy consumption data and then matched it with the monthly nightlight data. We then introduced the Factor-Augmented Regularized Model for prediction (FarmPredict) to deal with the dependence and collinearity issue among predictors by effectively lifting the prediction space, which is suitable to most machine learning algorithms. The heterogeneity issue in big data analysis is mitigated through the land-use data. FarmPredict allows us to extend the regional results to the city level, with a 76% out-of-sample explanation of the spatial and timeliness variation in the house usage. Since energy is indispensable for life, our method is highly transferable with the only requirement of publicly accessible data. Our article provides an alternative approach with statistical machine learning to predict socioeconomic outcomes without the reliance on existing census and survey data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1045-1059 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2096038 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2096038 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1045-1059 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2097086_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Wei Tu Author-X-Name-First: Wei Author-X-Name-Last: Tu Author-Name: Bei Jiang Author-X-Name-First: Bei Author-X-Name-Last: Jiang Author-Name: Linglong Kong Author-X-Name-First: Linglong Author-X-Name-Last: Kong Title: Comments on “Measuring Housing Vitality from Multi-Source Big Data and Machine Learning” Journal: Journal of the American Statistical Association Pages: 1060-1062 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2097086 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2097086 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1060-1062 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2104726_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Nicholas J. Horton Author-X-Name-First: Nicholas J. Author-X-Name-Last: Horton Title: Foundations of Statistics for Data Scientists: With R and Python Journal: Journal of the American Statistical Association Pages: 1603-1604 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2104726 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2104726 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1603-1604 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2041422_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Tianwen Ma Author-X-Name-First: Tianwen Author-X-Name-Last: Ma Author-Name: Yang Li Author-X-Name-First: Yang Author-X-Name-Last: Li Author-Name: Jane E. Huggins Author-X-Name-First: Jane E. Author-X-Name-Last: Huggins Author-Name: Ji Zhu Author-X-Name-First: Ji Author-X-Name-Last: Zhu Author-Name: Jian Kang Author-X-Name-First: Jian Author-X-Name-Last: Kang Title: Bayesian Inferences on Neural Activity in EEG-Based Brain-Computer Interface Abstract: A brain-computer interface (BCI) is a system that translates brain activity into commands to operate technology. A common design for an electroencephalogram (EEG) BCI relies on the classification of the P300 event-related potential (ERP), which is a response elicited by the rare occurrence of target stimuli among common nontarget stimuli. Few existing ERP classifiers directly explore the underlying mechanism of the neural activity. To this end, we perform a novel Bayesian analysis of the probability distribution of multi-channel real EEG signals under the P300 ERP-BCI design. We aim to identify relevant spatial temporal differences of the neural activity, which provides statistical evidence of P300 ERP responses and helps design individually efficient and accurate BCIs. As one key finding of our single participant analysis, there is a 90% posterior probability that the target ERPs of the channels around visual cortex reach their negative peaks around 200 milliseconds poststimulus. Our analysis identifies five important channels (PO7, PO8, Oz, P4, Cz) for the BCI speller leading to a 100% prediction accuracy. From the analyses of nine other participants, we consistently select the identified five channels, and the selection frequencies are robust to small variations of bandpass filters and kernel hyper parameters. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1122-1133 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2041422 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2041422 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1122-1133 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1844211_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Ting Li Author-X-Name-First: Ting Author-X-Name-Last: Li Author-Name: Tengfei Li Author-X-Name-First: Tengfei Author-X-Name-Last: Li Author-Name: Zhongyi Zhu Author-X-Name-First: Zhongyi Author-X-Name-Last: Zhu Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Title: Regression Analysis of Asynchronous Longitudinal Functional and Scalar Data Abstract: Many modern large-scale longitudinal neuroimaging studies, such as the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study, have collected/are collecting asynchronous scalar and functional variables that are measured at distinct time points. The analyses of temporally asynchronous functional and scalar variables pose major technical challenges to many existing statistical approaches. We propose a class of generalized functional partial-linear varying-coefficient models to appropriately deal with these challenges through introducing both scalar and functional coefficients of interest and using kernel weighting methods. We design penalized kernel-weighted estimating equations to estimate scalar and functional coefficients, in which we represent functional coefficients by using a rich truncated tensor product penalized B-spline basis. We establish the theoretical properties of scalar and functional coefficient estimators including consistency, convergence rate, prediction accuracy, and limiting distributions. We also propose a bootstrap method to test the nullity of both parametric and functional coefficients, while establishing the bootstrap consistency. Simulation studies and the analysis of the ADNI study are used to assess the finite sample performance of our proposed approach. Our real data analysis reveals significant relationship between fractional anisotropy density curves and cognitive function with education, baseline disease status and APOE4 gene as major contributing factors. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1228-1242 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1844211 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1844211 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1228-1242 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1850461_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Qinglong Tian Author-X-Name-First: Qinglong Author-X-Name-Last: Tian Author-Name: Fanqi Meng Author-X-Name-First: Fanqi Author-X-Name-Last: Meng Author-Name: Daniel J. Nordman Author-X-Name-First: Daniel J. Author-X-Name-Last: Nordman Author-Name: William Q. Meeker Author-X-Name-First: William Q. Author-X-Name-Last: Meeker Title: Predicting the Number of Future Events Abstract: This article describes prediction methods for the number of future events from a population of units associated with an on-going time-to-event process. Examples include the prediction of warranty returns and the prediction of the number of future product failures that could cause serious threats to property or life. Important decisions such as whether a product recall should be mandated are often based on such predictions. Data, generally right-censored (and sometimes left truncated and right-censored), are used to estimate the parameters of a time-to-event distribution. This distribution can then be used to predict the number of events over future periods of time. Such predictions are sometimes called within-sample predictions and differ from other prediction problems considered in most of the prediction literature. This article shows that the plug-in (also known as estimative or naive) prediction method is not asymptotically correct (i.e., for large amounts of data, the coverage probability always fails to converge to the nominal confidence level). However, a commonly used prediction calibration method is shown to be asymptotically correct for within-sample predictions, and two alternative predictive-distribution-based methods that perform better than the calibration method are presented and justified. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1296-1310 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1850461 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1850461 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1296-1310 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1863221_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Timothy W. Waite Author-X-Name-First: Timothy W. Author-X-Name-Last: Waite Author-Name: David C. Woods Author-X-Name-First: David C. Author-X-Name-Last: Woods Title: Minimax Efficient Random Experimental Design Strategies With Application to Model-Robust Design for Prediction Abstract: In game theory and statistical decision theory, a random (i.e., mixed) decision strategy often outperforms a deterministic strategy in minimax expected loss. As experimental design can be viewed as a game pitting the Statistician against Nature, the use of a random strategy to choose a design will often be beneficial. However, the topic of minimax-efficient random strategies for design selection is mostly unexplored, with consideration limited to Fisherian randomization of the allocation of a predetermined set of treatments to experimental units. Here, for the first time, novel and more flexible random design strategies are shown to have better properties than their deterministic counterparts in linear model estimation and prediction, including stronger bounds on both the expectation and survivor function of the loss distribution. Design strategies are considered for three important statistical problems: (i) parameter estimation in linear potential outcomes models, (ii) point prediction from a correct linear model, and (iii) global prediction from a linear model taking into account an L2-class of possible model discrepancy functions. The new random design strategies proposed for (iii) give a finite bound on the expected loss, a dramatic improvement compared to existing deterministic exact designs for which the expected loss is unbounded. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1452-1465 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1863221 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1863221 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1452-1465 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2027774_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Nathan B. Wikle Author-X-Name-First: Nathan B. Author-X-Name-Last: Wikle Author-Name: Ephraim M. Hanks Author-X-Name-First: Ephraim M. Author-X-Name-Last: Hanks Author-Name: Lucas R. F. Henneman Author-X-Name-First: Lucas R. F. Author-X-Name-Last: Henneman Author-Name: Corwin M. Zigler Author-X-Name-First: Corwin M. Author-X-Name-Last: Zigler Title: A Mechanistic Model of Annual Sulfate Concentrations in the United States Abstract: Understanding how individual pollution sources contribute to ambient sulfate pollution is critical for assessing past and future air quality regulations. Since attribution to specific sources is typically not encoded in spatial air pollution data, we develop a mechanistic model which we use to estimate, with uncertainty, the contribution of ambient sulfate concentrations attributable specifically to sulfur dioxide (SO2) emissions from individual coal-fired power plants in the central United States. We propose a multivariate Ornstein–Uhlenbeck (OU) process approximation to the dynamics of the underlying space-time chemical transport process, and its distributional properties are leveraged to specify novel probability models for spatial data that are viewed as either a snapshot or time-averaged observation of the OU process. Using US EPA SO2 emissions data from 193 power plants and state-of-the-art estimates of ground-level annual mean sulfate concentrations, we estimate that in 2011—a time of active power plant regulatory action—existing flue-gas desulfurization (FGD) technologies at 66 power plants reduced population-weighted exposure to ambient sulfate by 1.97 μg/m3 (95% CI: 1.80–2.15). Furthermore, we anticipate future regulatory benefits by estimating that installing FGD technologies at the five largest SO2-emitting facilities would reduce human exposure to ambient sulfate by an additional 0.45 μg/m3 (95% CI: 0.33–0.54). Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1082-1093 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2027774 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2027774 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1082-1093 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1864381_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Alan Riva-Palacio Author-X-Name-First: Alan Author-X-Name-Last: Riva-Palacio Author-Name: Fabrizio Leisen Author-X-Name-First: Fabrizio Author-X-Name-Last: Leisen Author-Name: Jim Griffin Author-X-Name-First: Jim Author-X-Name-Last: Griffin Title: Survival Regression Models With Dependent Bayesian Nonparametric Priors Abstract: We present a novel Bayesian nonparametric model for regression in survival analysis. Our model builds on the classical neutral to the right model of Doksum and on the Cox proportional hazards model of Kim and Lee. The use of a vector of dependent Bayesian nonparametric priors allows us to efficiently model the hazard as a function of covariates while allowing nonproportionality. The model can be seen as having competing latent risks. We characterize the posterior of the underlying dependent vector of completely random measures and study the asymptotic behavior of the model. We show how an MCMC scheme can provide Bayesian inference for posterior means and credible intervals. The method is illustrated using simulated and real data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1530-1539 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1864381 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1864381 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1530-1539 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2024436_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Santiago Olivella Author-X-Name-First: Santiago Author-X-Name-Last: Olivella Author-Name: Tyler Pratt Author-X-Name-First: Tyler Author-X-Name-Last: Pratt Author-Name: Kosuke Imai Author-X-Name-First: Kosuke Author-X-Name-Last: Imai Title: Dynamic Stochastic Blockmodel Regression for Network Data: Application to International Militarized Conflicts Abstract: The decision to engage in military conflict is shaped by many factors, including state- and dyad-level characteristics as well as the state’s membership in geopolitical coalitions. Supporters of the democratic peace theory, for example, hypothesize that the community of democratic states is less likely to wage war with each other. Such theories explain the ways in which nodal and dyadic characteristics affect the evolution of conflict patterns over time via their effects on group memberships. To test these arguments, we develop a dynamic model of network data by combining a hidden Markov model with a mixed-membership stochastic blockmodel that identifies latent groups underlying the network structure. Unlike existing models, we incorporate covariates that predict dynamic node memberships in latent groups as well as the direct formation of edges between dyads. While prior substantive research often assumes the decision to engage in international militarized conflict is independent across states and static over time, we demonstrate that conflict is driven by states’ evolving membership in geopolitical blocs. Our analysis of militarized disputes from 1816 to 2010 identifies two distinct blocs of democratic states, only one of which exhibits unusually low rates of conflict. Changes in monadic covariates like democracy shift states between coalitions, making some states more pacific but others more belligerent. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1068-1081 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2021.2024436 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2024436 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1068-1081 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1862669_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Daniel Malinsky Author-X-Name-First: Daniel Author-X-Name-Last: Malinsky Author-Name: Ilya Shpitser Author-X-Name-First: Ilya Author-X-Name-Last: Shpitser Author-Name: Eric J. Tchetgen Tchetgen Author-X-Name-First: Eric J. Author-X-Name-Last: Tchetgen Tchetgen Title: Semiparametric Inference for Nonmonotone Missing-Not-at-Random Data: The No Self-Censoring Model Abstract: We study the identification and estimation of statistical functionals of multivariate data missing nonmonotonically and not-at-random, taking a semiparametric approach. Specifically, we assume that the missingness mechanism satisfies what has been previously called “no self-censoring” or “itemwise conditionally independent nonresponse,” which roughly corresponds to the assumption that no partially observed variable directly determines its own missingness status. We show that this assumption, combined with an odds ratio parameterization of the joint density, enables identification of functionals of interest, and we establish the semiparametric efficiency bound for the nonparametric model satisfying this assumption. We propose a practical augmented inverse probability weighted estimator, and in the setting with a (possibly high-dimensional) always-observed subset of covariates, our proposed estimator enjoys a certain double-robustness property. We explore the performance of our estimator with simulation experiments and on a previously studied dataset of HIV-positive mothers in Botswana. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1415-1423 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1862669 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1862669 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1415-1423 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1850460_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Yiyuan She Author-X-Name-First: Yiyuan Author-X-Name-Last: She Author-Name: Zhifeng Wang Author-X-Name-First: Zhifeng Author-X-Name-Last: Wang Author-Name: Jiahui Shen Author-X-Name-First: Jiahui Author-X-Name-Last: Shen Title: Gaining Outlier Resistance With Progressive Quantiles: Fast Algorithms and Theoretical Studies Abstract: Outliers widely occur in big-data applications and may severely affect statistical estimation and inference. In this article, a framework of outlier-resistant estimation is introduced to robustify an arbitrarily given loss function. It has a close connection to the method of trimming and includes explicit outlyingness parameters for all samples, which in turn facilitates computation, theory, and parameter tuning. To tackle the issues of nonconvexity and nonsmoothness, we develop scalable algorithms with implementation ease and guaranteed fast convergence. In particular, a new technique is proposed to alleviate the requirement on the starting point such that on regular datasets, the number of data resamplings can be substantially reduced. Based on combined statistical and computational treatments, we are able to perform nonasymptotic analysis beyond M-estimation. The obtained resistant estimators, though not necessarily globally or even locally optimal, enjoy minimax rate optimality in both low dimensions and high dimensions. Experiments in regression, classification, and neural networks show excellent performance of the proposed methodology at the occurrence of gross outliers. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1282-1295 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1850460 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1850460 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1282-1295 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1841647_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Brenda Betancourt Author-X-Name-First: Brenda Author-X-Name-Last: Betancourt Author-Name: Giacomo Zanella Author-X-Name-First: Giacomo Author-X-Name-Last: Zanella Author-Name: Rebecca C. Steorts Author-X-Name-First: Rebecca C. Author-X-Name-Last: Steorts Title: Random Partition Models for Microclustering Tasks Abstract: Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution (ER), modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yield clusters whose sizes grow sublinearly with the total number of data points—the microclustering property. Motivated by these issues, we propose a general class of random partition models that satisfy the microclustering property with well-characterized theoretical properties. Our proposed models overcome major limitations in the existing literature on microclustering models, namely a lack of interpretability, identifiability, and full characterization of model asymptotic properties. Crucially, we drop the classical assumption of having an exchangeable sequence of data points, and instead assume an exchangeable sequence of clusters. In addition, our framework provides flexibility in terms of the prior distribution of cluster sizes, computational tractability, and applicability to a large number of microclustering tasks. We establish theoretical properties of the resulting class of priors, where we characterize the asymptotic behavior of the number of clusters and of the proportion of clusters of a given size. Our framework allows a simple and efficient Markov chain Monte Carlo algorithm to perform statistical inference. We illustrate our proposed methodology on the microclustering task of ER, where we provide a simulation study and real experiments on survey panel data. Journal: Journal of the American Statistical Association Pages: 1215-1227 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1841647 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1841647 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1215-1227 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2093729_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Nikolaos Ignatiadis Author-X-Name-First: Nikolaos Author-X-Name-Last: Ignatiadis Author-Name: Stefan Wager Author-X-Name-First: Stefan Author-X-Name-Last: Wager Title: Rejoinder: Confidence Intervals for Nonparametric Empirical Bayes Analysis Journal: Journal of the American Statistical Association Pages: 1192-1199 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2022.2093729 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2093729 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1192-1199 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1844720_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f Author-Name: Peter Hoff Author-X-Name-First: Peter Author-X-Name-Last: Hoff Title: Smaller p-Values via Indirect Information Abstract: This article develops p-values for evaluating means of normal populations that make use of indirect or prior information. A p-value of this type is based on a biased frequentist hypothesis test that has optimal average power with respect to a probability distribution that encodes indirect information about the mean parameter, resulting in a smaller p-value if the indirect information is accurate. In a variety of multiparameter settings, we show how to adaptively estimate the indirect information for each mean parameter while still maintaining uniformity of the p-values under their null hypotheses. This is done using a linking model through which indirect information about the mean of one population may be obtained from the data of other populations. Importantly, the linking model does not need to be correct to maintain the uniformity of the p-values under their null hypotheses. This methodology is illustrated in several data analysis scenarios, including small area inference, spatially arranged populations, interactions in linear regression, and generalized linear models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1254-1269 Issue: 539 Volume: 117 Year: 2022 Month: 9 X-DOI: 10.1080/01621459.2020.1844720 File-URL: http://hdl.handle.net/10.1080/01621459.2020.1844720 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1254-1269 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1904959_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Qing Mai Author-X-Name-First: Qing Author-X-Name-Last: Mai Author-Name: Xin Zhang Author-X-Name-First: Xin Author-X-Name-Last: Zhang Author-Name: Yuqing Pan Author-X-Name-First: Yuqing Author-X-Name-Last: Pan Author-Name: Kai Deng Author-X-Name-First: Kai Author-X-Name-Last: Deng Title: A Doubly Enhanced EM Algorithm for Model-Based Tensor Clustering Abstract: Modern scientific studies often collect datasets in the form of tensors. These datasets call for innovative statistical analysis methods. In particular, there is a pressing need for tensor clustering methods to understand the heterogeneity in the data. We propose a tensor normal mixture model approach to enable probabilistic interpretation and computational tractability. Our statistical model leverages the tensor covariance structure to reduce the number of parameters for parsimonious modeling, and at the same time explicitly exploits the correlations for better variable selection and clustering. We propose a doubly enhanced expectation–maximization (DEEM) algorithm to perform clustering under this model. Both the expectation-step and the maximization-step are carefully tailored for tensor data in order to maximize statistical accuracy and minimize computational costs in high dimensions. Theoretical studies confirm that DEEM achieves consistent clustering even when the dimension of each mode of the tensors grows at an exponential rate of the sample size. Numerical studies demonstrate favorable performance of DEEM in comparison to existing methods. Journal: Journal of the American Statistical Association Pages: 2120-2134 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1904959 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1904959 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2120-2134 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1876710_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Minsuk Shin Author-X-Name-First: Minsuk Author-X-Name-Last: Shin Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Title: Neuronized Priors for Bayesian Sparse Linear Regression Abstract: Although Bayesian variable selection methods have been intensively studied, their routine use in practice has not caught up with their non-Bayesian counterparts such as Lasso, likely due to difficulties in both computations and flexibilities of prior choices. To ease these challenges, we propose the neuronized priors to unify and extend some popular shrinkage priors, such as Laplace, Cauchy, horseshoe, and spike-and-slab priors. A neuronized prior can be written as the product of a Gaussian weight variable and a scale variable transformed from Gaussian via an activation function. Compared with classic spike-and-slab priors, the neuronized priors achieve the same explicit variable selection without employing any latent indicator variables, which results in both more efficient and flexible posterior sampling and more effective posterior modal estimation. Theoretically, we provide specific conditions on the neuronized formulation to achieve the optimal posterior contraction rate, and show that a broadly applicable MCMC algorithm achieves an exponentially fast convergence rate under the neuronized formulation. We also examine various simulated and real data examples and demonstrate that using the neuronization representation is computationally more or comparably efficient than its standard counterpart in all well-known cases. An R package NPrior is provided for using neuronized priors in Bayesian linear regression. Journal: Journal of the American Statistical Association Pages: 1695-1710 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1876710 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1876710 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1695-1710 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1891927_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Lucio Barabesi Author-X-Name-First: Lucio Author-X-Name-Last: Barabesi Author-Name: Andrea Cerasa Author-X-Name-First: Andrea Author-X-Name-Last: Cerasa Author-Name: Andrea Cerioli Author-X-Name-First: Andrea Author-X-Name-Last: Cerioli Author-Name: Domenico Perrotta Author-X-Name-First: Domenico Author-X-Name-Last: Perrotta Title: On Characterizations and Tests of Benford’s Law Abstract: Benford’s law defines a probability distribution for patterns of significant digits in real numbers. When the law is expected to hold for genuine observations, deviation from it can be taken as evidence of possible data manipulation. We derive results on a transform of the significand function that provide motivation for new tests of conformance to Benford’s law exploiting its sum-invariance characterization. We also study the connection between sum invariance of the first digit and the corresponding marginal probability distribution. We approximate the exact distribution of the new test statistics through a computationally efficient Monte Carlo algorithm. We investigate the power of our tests under different alternatives and we point out relevant situations in which they are clearly preferable to the available procedures. Finally, we show the application potential of our approach in the context of fraud detection in international trade. Journal: Journal of the American Statistical Association Pages: 1887-1903 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1891927 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1891927 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1887-1903 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2077209_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Le Bao Author-X-Name-First: Le Author-X-Name-Last: Bao Author-Name: Changcheng Li Author-X-Name-First: Changcheng Author-X-Name-Last: Li Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Author-Name: Songshan Yang Author-X-Name-First: Songshan Author-X-Name-Last: Yang Title: Causal Structural Learning on MPHIA Individual Dataset Abstract: The Population-based HIV Impact Assessment (PHIA) is an ongoing project that conducts nationally representative HIV-focused surveys for measuring national and regional progress toward UNAIDS’ 90-90-90 targets, the primary strategy to end the HIV epidemic. We believe the PHIA survey offers a unique opportunity to better understand the key factors that drive the HIV epidemics in the most affected countries in sub-Saharan Africa. In this article, we propose a novel causal structural learning algorithm to discover important covariates and potential causal pathways for 90-90-90 targets. Existing constraint-based causal structural learning algorithms are quite aggressive in edge removal. The proposed algorithm preserves more information about important features and potential causal pathways. It is applied to the Malawi PHIA (MPHIA) dataset and leads to interesting results. For example, it discovers age and condom usage to be important for female HIV awareness; the number of sexual partners to be important for male HIV awareness; and knowing the travel time to HIV care facilities leads to a higher chance of being treated for both females and males. We further compare and validate the proposed algorithm using BIC and using Monte Carlo simulations, and show that the proposed algorithm achieves improvement in true positive rates in important feature discovery over existing algorithms. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1642-1655 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2022.2077209 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2077209 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1642-1655 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1902817_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Dungang Liu Author-X-Name-First: Dungang Author-X-Name-Last: Liu Author-Name: Regina Y. Liu Author-X-Name-First: Regina Y. Author-X-Name-Last: Liu Author-Name: Min-ge Xie Author-X-Name-First: Min-ge Author-X-Name-Last: Xie Title: Nonparametric Fusion Learning for Multiparameters: Synthesize Inferences From Diverse Sources Using Data Depth and Confidence Distribution Abstract: Fusion learning refers to synthesizing inferences from multiple sources or studies to make a more effective inference and prediction than from any individual source or study alone. Most existing methods for synthesizing inferences rely on parametric model assumptions, such as normality, which often do not hold in practice. We propose a general nonparametric fusion learning framework for synthesizing inferences for multiparameters from different studies. The main tool underlying the proposed framework is the new notion of depth confidence distribution (depth-CD), which is developed by combining data depth and confidence distribution. Broadly speaking, a depth-CD is a data-driven nonparametric summary distribution of the available inferential information for a target parameter. We show that a depth-CD is a powerful inferential tool and, moreover, is an omnibus form of confidence regions, whose contours of level sets shrink toward the true parameter value. The proposed fusion learning approach combines depth-CDs from the individual studies, with each depth-CD constructed by nonparametric bootstrap and data depth. The approach is shown to be efficient, general and robust. Specifically, it achieves high-order accuracy and Bahadur efficiency under suitably chosen combining elements. It allows the model or inference structure to be different among individual studies. And, it readily adapts to heterogeneous studies with a broad range of complex and irregular settings. This last property enables the approach to use indirect evidence from incomplete studies to gain efficiency for the overall inference. We develop the theoretical support for the proposed approach, and we also illustrate the approach in making combined inference for the common mean vector and correlation coefficient from several studies. The numerical results from simulated studies show the approach to be less biased and more efficient than the traditional approaches in nonnormal settings. The advantages of the approach are also demonstrated in a Federal Aviation Administration study of aircraft landing performance. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2086-2104 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1902817 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1902817 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2086-2104 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1909598_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Gaetano Romano Author-X-Name-First: Gaetano Author-X-Name-Last: Romano Author-Name: Guillem Rigaill Author-X-Name-First: Guillem Author-X-Name-Last: Rigaill Author-Name: Vincent Runge Author-X-Name-First: Vincent Author-X-Name-Last: Runge Author-Name: Paul Fearnhead Author-X-Name-First: Paul Author-X-Name-Last: Fearnhead Title: Detecting Abrupt Changes in the Presence of Local Fluctuations and Autocorrelated Noise Abstract: While there are a plethora of algorithms for detecting changes in mean in univariate time-series, almost all struggle in real applications where there is autocorrelated noise or where the mean fluctuates locally between the abrupt changes that one wishes to detect. In these cases, default implementations, which are often based on assumptions of a constant mean between changes and independent noise, can lead to substantial over-estimation of the number of changes. We propose a principled approach to detect such abrupt changes that models local fluctuations as a random walk process and autocorrelated noise via an AR(1) process. We then estimate the number and location of changepoints by minimizing a penalized cost based on this model. We develop a novel and efficient dynamic programming algorithm, DeCAFS, that can solve this minimization problem; despite the additional challenge of dependence across segments, due to the autocorrelated noise, which makes existing algorithms inapplicable. Theory and empirical results show that our approach has greater power at detecting abrupt changes than existing approaches. We apply our method to measuring gene expression levels in bacteria. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2147-2162 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1909598 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1909598 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2147-2162 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1895810_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Jiayin Zheng Author-X-Name-First: Jiayin Author-X-Name-Last: Zheng Author-Name: Yingye Zheng Author-X-Name-First: Yingye Author-X-Name-Last: Zheng Author-Name: Li Hsu Author-X-Name-First: Li Author-X-Name-Last: Hsu Title: Risk Projection for Time-to-Event Outcome Leveraging Summary Statistics With Source Individual-Level Data Abstract: Predicting risks of chronic diseases has become increasingly important in clinical practice. When a prediction model is developed in a cohort, there is a great interest to apply the model to other cohorts. Due to potential discrepancy in baseline disease incidences between different cohorts and shifts in patient composition, the risk predicted by the model built in the source cohort often under- or over-estimates the risk in a new cohort. In this article, we assume the relative risks of predictors are the same between the two cohorts, and propose a novel weighted estimating equation approach to recalibrating the projected risk for the targeted population through updating the baseline risk. The recalibration leverages the knowledge about survival probabilities for the disease of interest and competing events, and summary information of risk factors from the target population. We establish the consistency and asymptotic normality of the proposed estimators. Extensive simulation demonstrate that the proposed estimators are robust, even if the risk factor distributions differ between the source and target populations, and gain efficiency if they are the same, as long as the information from the target is precise. The method is illustrated with a recalibration of colorectal cancer prediction model. Journal: Journal of the American Statistical Association Pages: 2043-2055 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1895810 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1895810 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2043-2055 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1906685_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Peter Z. Schochet Author-X-Name-First: Peter Z. Author-X-Name-Last: Schochet Author-Name: Nicole E. Pashley Author-X-Name-First: Nicole E. Author-X-Name-Last: Pashley Author-Name: Luke W. Miratrix Author-X-Name-First: Luke W. Author-X-Name-Last: Miratrix Author-Name: Tim Kautz Author-X-Name-First: Tim Author-X-Name-Last: Kautz Title: Design-Based Ratio Estimators and Central Limit Theorems for Clustered, Blocked RCTs Abstract: This article develops design-based ratio estimators for clustered, blocked randomized controlled trials (RCTs), with an application to a federally funded, school-based RCT testing the effects of behavioral health interventions. We consider finite population weighted least-square estimators for average treatment effects (ATEs), allowing for general weighting schemes and covariates. We consider models with block-by-treatment status interactions as well as restricted models with block indicators only. We prove new finite population central limit theorems for each block specification. We also discuss simple variance estimators that share features with commonly used cluster-robust standard error estimators. Simulations show that the design-based ATE estimator yields nominal rejection rates with standard errors near true ones, even with few clusters. Journal: Journal of the American Statistical Association Pages: 2135-2146 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1906685 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1906685 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2135-2146 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1882466_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Xiaowu Dai Author-X-Name-First: Xiaowu Author-X-Name-Last: Dai Author-Name: Lexin Li Author-X-Name-First: Lexin Author-X-Name-Last: Li Title: Kernel Ordinary Differential Equations Abstract: Ordinary differential equation (ODE) is widely used in modeling biological and physical processes in science. In this article, we propose a new reproducing kernel-based approach for estimation and inference of ODE given noisy observations. We do not assume the functional forms in ODE to be known, or restrict them to be linear or additive, and we allow pairwise interactions. We perform sparse estimation to select individual functionals, and construct confidence intervals for the estimated signal trajectories. We establish the estimation optimality and selection consistency of kernel ODE under both the low-dimensional and high-dimensional settings, where the number of unknown functionals can be smaller or larger than the sample size. Our proposal builds upon the smoothing spline analysis of variance (SS-ANOVA) framework, but tackles several important problems that are not yet fully addressed, and thus extends the scope of existing SS-ANOVA as well. We demonstrate the efficacy of our method through numerous ODE examples. Journal: Journal of the American Statistical Association Pages: 1711-1725 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1882466 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1882466 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1711-1725 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1915319_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Yingying Zhang Author-X-Name-First: Yingying Author-X-Name-Last: Zhang Author-Name: Huixia Judy Wang Author-X-Name-First: Huixia Judy Author-X-Name-Last: Wang Author-Name: Zhongyi Zhu Author-X-Name-First: Zhongyi Author-X-Name-Last: Zhu Title: Single-index Thresholding in Quantile Regression Abstract: Threshold regression models are useful for identifying subgroups with heterogeneous parameters. The conventional threshold regression models split the sample based on a single and observed threshold variable, which enforces the threshold point to be equal for all subgroups of the population. In this article, we consider a more flexible single-index threshold model in the quantile regression setup, in which the sample is split based on a linear combination of predictors. We propose a new estimator by smoothing the indicator function in thresholding, which enables Gaussian approximation for statistical inference and allows characterizing the limiting distribution when the quantile process is interested. We further construct a mixed-bootstrap inference method with faster computation and a procedure for testing the constancy of the threshold parameters across quantiles. Finally, we demonstrate the value of the proposed methods via simulation studies, as well as through the application to an executive compensation data. Journal: Journal of the American Statistical Association Pages: 2222-2237 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1915319 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1915319 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2222-2237 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1891926_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Daniel R. Kowal Author-X-Name-First: Daniel R. Author-X-Name-Last: Kowal Title: Fast, Optimal, and Targeted Predictions Using Parameterized Decision Analysis Abstract: Prediction is critical for decision-making under uncertainty and lends validity to statistical inference. With targeted prediction, the goal is to optimize predictions for specific decision tasks of interest, which we represent via functionals. Although classical decision analysis extracts predictions from a Bayesian model, these predictions are often difficult to interpret and slow to compute. Instead, we design a class of parameterized actions for Bayesian decision analysis that produce optimal, scalable, and simple targeted predictions. For a wide variety of action parameterizations and loss functions—including linear actions with sparsity constraints for targeted variable selection—we derive a convenient representation of the optimal targeted prediction that yields efficient and interpretable solutions. Customized out-of-sample predictive metrics are developed to evaluate and compare among targeted predictors. Through careful use of the posterior predictive distribution, we introduce a procedure that identifies a set of near-optimal, or acceptable targeted predictors, which provide unique insights into the features and level of complexity needed for accurate targeted prediction. Simulations demonstrate excellent prediction, estimation, and variable selection capabilities. Targeted predictions are constructed for physical activity (PA) data from the National Health and Nutrition Examination Survey to better predict and understand the characteristics of intraday PA. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1875-1886 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1891926 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1891926 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1875-1886 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1896526_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Walter Dempsey Author-X-Name-First: Walter Author-X-Name-Last: Dempsey Author-Name: Brandon Oselio Author-X-Name-First: Brandon Author-X-Name-Last: Oselio Author-Name: Alfred Hero Author-X-Name-First: Alfred Author-X-Name-Last: Hero Title: Hierarchical Network Models for Exchangeable Structured Interaction Processes Abstract: Network data often arises via a series of structured interactions among a population of constituent elements. E-mail exchanges, for example, have a single sender followed by potentially multiple receivers. Scientific articles, on the other hand, may have multiple subject areas and multiple authors. We introduce a statistical model, termed the Pitman-Yor hierarchical vertex components model (PY-HVCM), that is well suited for structured interaction data. The proposed PY-HVCM effectively models complex relational data by partial pooling of local information via a latent, shared population-level distribution. The PY-HCVM is a canonical example of hierarchical vertex components models—a subfamily of models for exchangeable structured interaction-labeled networks, that is, networks invariant to interaction relabeling. Theoretical analysis and supporting simulations provide clear model interpretation, and establish global sparsity and power law degree distribution. A computationally tractable Gibbs sampling algorithm is derived for inferring sparsity and power law properties of complex networks. We demonstrate the model on both the Enron e-mail dataset and an ArXiv dataset, showing goodness of fit of the model via posterior predictive validation. Journal: Journal of the American Statistical Association Pages: 2056-2073 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1896526 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1896526 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2056-2073 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1901718_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Xuening Zhu Author-X-Name-First: Xuening Author-X-Name-Last: Zhu Author-Name: Zhanrui Cai Author-X-Name-First: Zhanrui Author-X-Name-Last: Cai Author-Name: Yanyuan Ma Author-X-Name-First: Yanyuan Author-X-Name-Last: Ma Title: Network Functional Varying Coefficient Model Abstract: We consider functional responses with network dependence observed for each individual at irregular time points. To model both the interindividual dependence and within-individual dynamic correlation, we propose a network functional varying coefficient (NFVC) model. The response of each individual is characterized by a linear combination of responses from its connected nodes and its exogenous covariates. All the model coefficients are allowed to be time dependent. The NFVC model adds to the richness of both the classical network autoregression model and the functional regression models. To overcome the complexity caused by the network interdependence, we devise a special nonparametric least-squares-type estimator, which is feasible when the responses are observed at irregular time points for different individuals. The estimator takes advantage of the sparsity of the network structure to reduce the computational burden. To further conduct the functional principal component analysis, a novel within-individual covariance function estimation method is proposed and studied. Theoretical properties of our estimators, which involve techniques related to empirical processes, nonparametrics, functional data analysis and various concentration inequalities, are analyzed. We analyze a social network dataset to illustrate the powerfulness of the proposed procedure. Journal: Journal of the American Statistical Association Pages: 2074-2085 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1901718 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1901718 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2074-2085 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1909599_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Jacob Fiksel Author-X-Name-First: Jacob Author-X-Name-Last: Fiksel Author-Name: Abhirup Datta Author-X-Name-First: Abhirup Author-X-Name-Last: Datta Author-Name: Agbessi Amouzou Author-X-Name-First: Agbessi Author-X-Name-Last: Amouzou Author-Name: Scott Zeger Author-X-Name-First: Scott Author-X-Name-Last: Zeger Title: Generalized Bayes Quantification Learning under Dataset Shift Abstract: Quantification learning is the task of prevalence estimation for a test population using predictions from a classifier trained on a different population. Quantification methods assume that the sensitivities and specificities of the classifier are either perfect or transportable from the training to the test population. These assumptions are inappropriate in the presence of dataset shift, when the misclassification rates in the training population are not representative of those for the test population. Quantification under dataset shift has been addressed only for single-class (categorical) predictions and assuming perfect knowledge of the true labels on a small subset of the test population. We propose generalized Bayes quantification learning (GBQL) that uses the entire compositional predictions from probabilistic classifiers and allows for uncertainty in true class labels for the limited labeled test data. Instead of positing a full model, we use a model-free Bayesian estimating equation approach to compositional data using Kullback–Leibler loss-functions based only on a first-moment assumption. The idea will be useful in Bayesian compositional data analysis in general as it is robust to different generating mechanisms for compositional data and allows 0’s and 1’s in the compositional outputs thereby including categorical outputs as a special case. We show how our method yields existing quantification approaches as special cases. Extension to an ensemble GBQL that uses predictions from multiple classifiers yielding inference robust to inclusion of a poor classifier is discussed. We outline a fast and efficient Gibbs sampler using a rounding and coarsening approximation to the loss functions. We establish posterior consistency, asymptotic normality and valid coverage of interval estimates from GBQL, which to our knowledge are the first theoretical results for a quantification approach in the presence of local labeled data. We also establish finite sample posterior concentration rate. Empirical performance of GBQL is demonstrated through simulations and analysis of real data with evident dataset shift. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2163-2181 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1909599 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1909599 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2163-2181 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2054816_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Blakeley B. McShane Author-X-Name-First: Blakeley B. Author-X-Name-Last: McShane Author-Name: Ulf Böckenholt Author-X-Name-First: Ulf Author-X-Name-Last: Böckenholt Author-Name: Karsten T. Hansen Author-X-Name-First: Karsten T. Author-X-Name-Last: Hansen Title: Variation and Covariation in Large-Scale Replication Projects: An Evaluation of Replicability Abstract: Over the last decade, large-scale replication projects across the biomedical and social sciences have reported relatively low replication rates. In these large-scale replication projects, replication has typically been evaluated based on a single replication study of some original study and dichotomously as successful or failed. However, evaluations of replicability that are based on a single study and are dichotomous are inadequate, and evaluations of replicability should instead be based on multiple studies, be continuous, and be multi-faceted. Further, such evaluations are in fact possible due to two characteristics shared by many large-scale replication projects. In this article, we provide such an evaluation for two prominent large-scale replication projects, one which replicated a phenomenon from cognitive psychology and another which replicated 13 phenomena from social psychology and behavioral economics. Our results indicate a very high degree of replicability in the former and a medium to low degree of replicability in the latter. They also suggest an unidentified covariate in each, namely ocular dominance in the former and political ideology in the latter, that is theoretically pertinent. We conclude by discussing evaluations of replicability at large, recommendations for future large-scale replication projects, and design-based model generalization. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1605-1621 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2022.2054816 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2054816 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1605-1621 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1895175_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Yan Sun Author-X-Name-First: Yan Author-X-Name-Last: Sun Author-Name: Qifan Song Author-X-Name-First: Qifan Author-X-Name-Last: Song Author-Name: Faming Liang Author-X-Name-First: Faming Author-X-Name-Last: Liang Title: Consistent Sparse Deep Learning: Theory and Computation Abstract: Deep learning has been the engine powering many successes of data science. However, the deep neural network (DNN), as the basic model of deep learning, is often excessively over-parameterized, causing many difficulties in training, prediction and interpretation. We propose a frequentist-like method for learning sparse DNNs and justify its consistency under the Bayesian framework: the proposed method could learn a sparse DNN with at most O(n/ log (n)) connections and nice theoretical guarantees such as posterior consistency, variable selection consistency and asymptotically optimal generalization bounds. In particular, we establish posterior consistency for the sparse DNN with a mixture Gaussian prior, show that the structure of the sparse DNN can be consistently determined using a Laplace approximation-based marginal posterior inclusion probability approach, and use Bayesian evidence to elicit sparse DNNs learned by an optimization method such as stochastic gradient descent in multiple runs with different initializations. The proposed method is computationally more efficient than standard Bayesian methods for large-scale sparse DNNs. The numerical results indicate that the proposed method can perform very well for large-scale network compression and high-dimensional nonlinear variable selection, both advancing interpretable machine learning. Journal: Journal of the American Statistical Association Pages: 1981-1995 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1895175 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1895175 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1981-1995 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1888740_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Sai Li Author-X-Name-First: Sai Author-X-Name-Last: Li Author-Name: T. Tony Cai Author-X-Name-First: T. Tony Author-X-Name-Last: Cai Author-Name: Hongzhe Li Author-X-Name-First: Hongzhe Author-X-Name-Last: Li Title: Inference for High-Dimensional Linear Mixed-Effects Models: A Quasi-Likelihood Approach Abstract: Linear mixed-effects models are widely used in analyzing clustered or repeated measures data. We propose a quasi-likelihood approach for estimation and inference of the unknown parameters in linear mixed-effects models with high-dimensional fixed effects. The proposed method is applicable to general settings where the dimension of the random effects and the cluster sizes are possibly large. Regarding the fixed effects, we provide rate optimal estimators and valid inference procedures that do not rely on the structural information of the variance components. We also study the estimation of variance components with high-dimensional fixed effects in general settings. The algorithms are easy to implement and computationally fast. The proposed methods are assessed in various simulation settings and are applied to a real study regarding the associations between body mass index and genetic polymorphic markers in a heterogeneous stock mice population. Journal: Journal of the American Statistical Association Pages: 1835-1846 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1888740 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1888740 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1835-1846 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1889565_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Gonzalo García-Donato Author-X-Name-First: Gonzalo Author-X-Name-Last: García-Donato Author-Name: Rui Paulo Author-X-Name-First: Rui Author-X-Name-Last: Paulo Title: Variable Selection in the Presence of Factors: A Model Selection Perspective Abstract: In the context of a Gaussian multiple regression model, we address the problem of variable selection when in the list of potential predictors there are factors, that is, categorical variables. We adopt a model selection perspective, that is, we approach the problem by constructing a class of models, each corresponding to a particular selection of active variables. The methodology is Bayesian and proceeds by computing the posterior probability of each of these models. We highlight the fact that the set of competing models depends on the dummy variable representation of the factors, an issue already documented by Fernández et al. in a particular example but that has not received any attention since then. We construct methodology that circumvents this problem and that presents very competitive frequentist behavior when compared with recently proposed techniques. Additionally, it is fully automatic, in that it does not require the specification of any tuning parameters. Journal: Journal of the American Statistical Association Pages: 1847-1857 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1889565 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1889565 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1847-1857 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1912758_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Yaniv Tenzer Author-X-Name-First: Yaniv Author-X-Name-Last: Tenzer Author-Name: Micha Mandel Author-X-Name-First: Micha Author-X-Name-Last: Mandel Author-Name: Or Zuk Author-X-Name-First: Or Author-X-Name-Last: Zuk Title: Testing Independence Under Biased Sampling Abstract: Testing for dependence between pairs of random variables is a fundamental problem in statistics. In some applications, data are subject to selection bias that can create spurious dependence. An important example is truncation models, in which observed pairs are restricted to a specific subset of the X-Y plane. Standard tests for independence are not suitable in such cases, and alternative tests that take the selection bias into account are required. Here, we generalize the notion of quasi-independence with respect to the sampling mechanism, and study the problem of detecting any deviations from it. We develop two tests statistics motivated by the classic Hoeffding’s statistic, and use two approaches to compute their distribution under the null: (i) a bootstrap-based approach, and (ii) a permutation-test with nonuniform probability of permutations. We also handle an important application to the case of censoring with truncation, by estimating the biased sampling mechanism from the data. We prove the validity of the tests, and show, using simulations, that they improve power compared to competing methods for important special cases. The tests are applied to four datasets, two that are subject to truncation, with and without censoring, and two to bias mechanisms related to length bias. Journal: Journal of the American Statistical Association Pages: 2194-2206 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1912758 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1912758 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2194-2206 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1883437_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Yuehao Bai Author-X-Name-First: Yuehao Author-X-Name-Last: Bai Author-Name: Joseph P. Romano Author-X-Name-First: Joseph P. Author-X-Name-Last: Romano Author-Name: Azeem M. Shaikh Author-X-Name-First: Azeem M. Author-X-Name-Last: Shaikh Title: Inference in Experiments With Matched Pairs Abstract: This article studies inference for the average treatment effect in randomized controlled trials where treatment status is determined according to a “matched pairs” design. By a “matched pairs” design, we mean that units are sampled iid from the population of interest, paired according to observed, baseline covariates and finally, within each pair, one unit is selected at random for treatment. This type of design is used routinely throughout the sciences, but fundamental questions about its implications for inference about the average treatment effect remain. The main requirement underlying our analysis is that pairs are formed so that units within pairs are suitably “close” in terms of the baseline covariates, and we develop novel results to ensure that pairs are formed in a way that satisfies this condition. Under this assumption, we show that, for the problem of testing the null hypothesis that the average treatment effect equals a prespecified value in such settings, the commonly used two-sample t-test and “matched pairs” t-test are conservative in the sense that these tests have limiting rejection probability under the null hypothesis no greater than and typically strictly less than the nominal level. We show, however, that a simple adjustment to the standard errors of these tests leads to a test that is asymptotically exact in the sense that its limiting rejection probability under the null hypothesis equals the nominal level. We also study the behavior of randomization tests that arise naturally in these types of settings. When implemented appropriately, we show that this approach also leads to a test that is asymptotically exact in the sense described previously, but additionally has finite-sample rejection probability no greater than the nominal level for certain distributions satisfying the null hypothesis. A simulation study and empirical application confirm the practical relevance of our theoretical results. Journal: Journal of the American Statistical Association Pages: 1726-1737 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1883437 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1883437 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1726-1737 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2117703_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Blakeley B. McShane Author-X-Name-First: Blakeley B. Author-X-Name-Last: McShane Author-Name: Ulf Böckenholt Author-X-Name-First: Ulf Author-X-Name-Last: Böckenholt Author-Name: Karsten T. Hansen Author-X-Name-First: Karsten T. Author-X-Name-Last: Hansen Title: Modeling and Learning From Variation and Covariation Journal: Journal of the American Statistical Association Pages: 1627-1630 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2022.2117703 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2117703 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1627-1630 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2066536_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Wensheng Guo Author-X-Name-First: Wensheng Author-X-Name-Last: Guo Author-Name: Mengying You Author-X-Name-First: Mengying Author-X-Name-Last: You Author-Name: Jialin Yi Author-X-Name-First: Jialin Author-X-Name-Last: Yi Author-Name: Michel A. Pontari Author-X-Name-First: Michel A. Author-X-Name-Last: Pontari Author-Name: J. Richard Landis Author-X-Name-First: J. Richard Author-X-Name-Last: Landis Title: Functional Mixed Effects Clustering with Application to Longitudinal Urologic Chronic Pelvic Pain Syndrome Symptom Data Abstract: By clustering patients with the urologic chronic pelvic pain syndromes (UCPPS) into homogeneous subgroups and associating these subgroups with baseline covariates and other clinical outcomes, we provide opportunities to investigate different potential elements of pathogenesis, which may also guide us in selection of appropriate therapeutic targets. Motivated by the longitudinal urologic symptom data with extensive subject heterogeneity and differential variability of trajectories, we propose a functional clustering procedure where each subgroup is modeled by a functional mixed effects model, and the posterior probability is used to iteratively classify each subject into different subgroups. The classification takes into account both group-average trajectories and between-subject variabilities. We develop an equivalent state-space model for efficient computation. We also propose a cross-validation based Kullback–Leibler information criterion to choose the optimal number of subgroups. The performance of the proposed method is assessed through a simulation study. We apply our methods to longitudinal bi-weekly measures of a primary urological urinary symptoms score from a UCPPS longitudinal cohort study, and identify four subgroups ranging from moderate decline, mild decline, stable and mild increasing. The resulting clusters are also associated with the one-year changes in several clinically important outcomes, and are also related to several clinically relevant baseline predictors, such as sleep disturbance score, physical quality of life and painful urgency. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1631-1641 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2022.2066536 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2066536 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1631-1641 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1909600_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Jacob Vorstrup Goldman Author-X-Name-First: Jacob Vorstrup Author-X-Name-Last: Goldman Author-Name: Torben Sell Author-X-Name-First: Torben Author-X-Name-Last: Sell Author-Name: Sumeetpal Sidhu Singh Author-X-Name-First: Sumeetpal Sidhu Author-X-Name-Last: Singh Title: Gradient-Based Markov Chain Monte Carlo for Bayesian Inference With Non-differentiable Priors Abstract: The use of nondifferentiable priors in Bayesian statistics has become increasingly popular, in particular in Bayesian imaging analysis. Current state-of-the-art methods are approximate in the sense that they replace the posterior with a smooth approximation via Moreau-Yosida envelopes, and apply gradient-based discretized diffusions to sample from the resulting distribution. We characterize the error of the Moreau-Yosida approximation and propose a novel implementation using underdamped Langevin dynamics. In misson-critical cases, however, replacing the posterior with an approximation may not be a viable option. Instead, we show that piecewise-deterministic Markov processes (PDMP) can be used for exact posterior inference from distributions satisfying almost everywhere differentiability. Furthermore, in contrast with diffusion-based methods, the suggested PDMP-based samplers place no assumptions on the prior shape, nor require access to a computationally cheap proximal operator, and consequently have a much broader scope of application. Through detailed numerical examples, including a nondifferentiable circular distribution and a nonconvex genomics model, we elucidate the relative strengths of these sampling methods on problems of moderate to high dimensions, underlining the benefits of PDMP-based methods when accurate sampling is decisive. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2182-2193 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1909600 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1909600 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2182-2193 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1895178_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Ziyang Lyu Author-X-Name-First: Ziyang Author-X-Name-Last: Lyu Author-Name: A.H. Welsh Author-X-Name-First: A.H. Author-X-Name-Last: Welsh Title: Asymptotics for EBLUPs: Nested Error Regression Models Abstract: In this article we derive the asymptotic distribution of estimated best linear unbiased predictors (EBLUPs) of the random effects in a nested error regression model. Under very mild conditions which do not require the assumption of normality, we show that asymptotically the distribution of the EBLUPs as both the number of clusters and the cluster sizes diverge to infinity is the convolution of the true distribution of the random effects and a normal distribution. This result yields very simple asymptotic approximations to and estimators of the prediction mean squared error of EBLUPs, and then asymptotic prediction intervals for the unobserved random effects. We also derive a higher order approximation to the asymptotic mean squared error and provide a detailed theoretical and empirical comparison with the well-known analytical prediction mean squared error approximations and estimators proposed by Kackar and Harville and Prasad and Rao. We show that our simple estimator of the predictor mean squared errors of EBLUPs works very well in practice when both the number of clusters and the cluster sizes are sufficiently large. Finally, we illustrate the use of the asymptotic prediction intervals with data on radon measurements of houses in Massachusetts and Arizona. Journal: Journal of the American Statistical Association Pages: 2028-2042 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1895178 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1895178 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2028-2042 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1904958_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Tianxi Cai Author-X-Name-First: Tianxi Author-X-Name-Last: Cai Author-Name: Molei Liu Author-X-Name-First: Molei Author-X-Name-Last: Liu Author-Name: Yin Xia Author-X-Name-First: Yin Author-X-Name-Last: Xia Title: Individual Data Protected Integrative Regression Analysis of High-Dimensional Heterogeneous Data Abstract: Evidence-based decision making often relies on meta-analyzing multiple studies, which enables more precise estimation and investigation of generalizability. Integrative analysis of multiple heterogeneous studies is, however, highly challenging in the ultra high-dimensional setting. The challenge is even more pronounced when the individual-level data cannot be shared across studies, known as DataSHIELD contraint. Under sparse regression models that are assumed to be similar yet not identical across studies, we propose in this paper a novel integrative estimation procedure for data-Shielding High-dimensional Integrative Regression (SHIR). SHIR protects individual data through summary-statistics-based integrating procedure, accommodates between-study heterogeneity in both the covariate distribution and model parameters, and attains consistent variable selection. Theoretically, SHIR is statistically more efficient than the existing distributed approaches that integrate debiased LASSO estimators from the local sites. Furthermore, the estimation error incurred by aggregating derived data is negligible compared to the statistical minimax rate and SHIR is shown to be asymptotically equivalent in estimation to the ideal estimator obtained by sharing all data. The finite-sample performance of our method is studied and compared with existing approaches via extensive simulation settings. We further illustrate the utility of SHIR to derive phenotyping algorithms for coronary artery disease using electronic health records data from multiple chronic disease cohorts. Journal: Journal of the American Statistical Association Pages: 2105-2119 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1904958 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1904958 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2105-2119 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1886937_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Xi Chen Author-X-Name-First: Xi Author-X-Name-Last: Chen Author-Name: Jason D. Lee Author-X-Name-First: Jason D. Author-X-Name-Last: Lee Author-Name: He Li Author-X-Name-First: He Author-X-Name-Last: Li Author-Name: Yun Yang Author-X-Name-First: Yun Author-X-Name-Last: Yang Title: Distributed Estimation for Principal Component Analysis: An Enlarged Eigenspace Analysis Abstract: The growing size of modern datasets brings many challenges to the existing statistical estimation approaches, which calls for new distributed methodologies. This article studies distributed estimation for a fundamental statistical machine learning problem, principal component analysis (PCA). Despite the massive literature on top eigenvector estimation, much less is presented for the top-L-dim (L > 1) eigenspace estimation, especially in a distributed manner. We propose a novel multi-round algorithm for constructing top-L-dim eigenspace for distributed data. Our algorithm takes advantage of shift-and-invert preconditioning and convex optimization. Our estimator is communication-efficient and achieves a fast convergence rate. In contrast to the existing divide-and-conquer algorithm, our approach has no restriction on the number of machines. Theoretically, the traditional Davis–Kahan theorem requires the explicit eigengap assumption to estimate the top-L-dim eigenspace. To abandon this eigengap assumption, we consider a new route in our analysis: instead of exactly identifying the top-L-dim eigenspace, we show that our estimator is able to cover the targeted top-L-dim population eigenspace. Our distributed algorithm can be applied to a wide range of statistical problems based on PCA, such as principal component regression and single index model. Finally, we provide simulation studies to demonstrate the performance of the proposed distributed estimator. Journal: Journal of the American Statistical Association Pages: 1775-1786 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1886937 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1886937 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1775-1786 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2139708_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: S. Lynne Stokes Author-X-Name-First: S. Lynne Author-X-Name-Last: Stokes Title: Sampling: Design and Analysis, 3rd ed. Journal: Journal of the American Statistical Association Pages: 2287-2288 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2022.2139708 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2139708 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2287-2288 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1888739_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Nabarun Deb Author-X-Name-First: Nabarun Author-X-Name-Last: Deb Author-Name: Sujayam Saha Author-X-Name-First: Sujayam Author-X-Name-Last: Saha Author-Name: Adityanand Guntuboyina Author-X-Name-First: Adityanand Author-X-Name-Last: Guntuboyina Author-Name: Bodhisattva Sen Author-X-Name-First: Bodhisattva Author-X-Name-Last: Sen Title: Two-Component Mixture Model in the Presence of Covariates Abstract: In this article, we study a generalization of the two-groups model in the presence of covariates—a problem that has recently received much attention in the statistical literature due to its applicability in multiple hypotheses testing problems. The model we consider allows for infinite dimensional parameters and offers flexibility in modeling the dependence of the response on the covariates. We discuss the identifiability issues arising in this model and systematically study several estimation strategies. We propose a tuning parameter-free nonparametric maximum likelihood method, implementable via the expectation-maximization algorithm, to estimate the unknown parameters. Further, we derive the rate of convergence of the proposed estimators—in particular we show that the finite sample Hellinger risk for every ‘approximate’ nonparametric maximum likelihood estimator achieves a near-parametric rate (up to logarithmic multiplicative factors). In addition, we propose and theoretically study two ‘marginal’ methods that are more scalable and easily implementable. We demonstrate the efficacy of our procedures through extensive simulation studies and relevant data analyses—one arising from neuroscience and the other from astronomy. We also outline the application of our methods to multiple testing. The companion R package NPMLEmix implements all the procedures proposed in this article. Journal: Journal of the American Statistical Association Pages: 1820-1834 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1888739 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1888739 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1820-1834 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1895177_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Chengchun Shi Author-X-Name-First: Chengchun Author-X-Name-Last: Shi Author-Name: Lexin Li Author-X-Name-First: Lexin Author-X-Name-Last: Li Title: Testing Mediation Effects Using Logic of Boolean Matrices Abstract: A central question in high-dimensional mediation analysis is to infer the significance of individual mediators. The main challenge is that the total number of potential paths that go through any mediator is super-exponential in the number of mediators. Most existing mediation inference solutions either explicitly impose that the mediators are conditionally independent given the exposure, or ignore any potential directed paths among the mediators. In this article, we propose a novel hypothesis testing procedure to evaluate individual mediation effects, while taking into account potential interactions among the mediators. Our proposal thus fills a crucial gap, and greatly extends the scope of existing mediation tests. Our key idea is to construct the test statistic using the logic of Boolean matrices, which enables us to establish the proper limiting distribution under the null hypothesis. We further employ screening, data splitting, and decorrelated estimation to reduce the bias and increase the power of the test. We show that our test can control both the size and false discovery rate asymptotically, and the power of the test approaches one, while allowing the number of mediators to diverge to infinity with the sample size. We demonstrate the efficacy of the method through simulations and a neuroimaging study of Alzheimer’s disease. A Python implementation of the proposed procedure is available at https://github.com/callmespring/LOGAN. Journal: Journal of the American Statistical Association Pages: 2014-2027 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1895177 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1895177 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2014-2027 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1915320_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: David Azriel Author-X-Name-First: David Author-X-Name-Last: Azriel Author-Name: Lawrence D. Brown Author-X-Name-First: Lawrence D. Author-X-Name-Last: Brown Author-Name: Michael Sklar Author-X-Name-First: Michael Author-X-Name-Last: Sklar Author-Name: Richard Berk Author-X-Name-First: Richard Author-X-Name-Last: Berk Author-Name: Andreas Buja Author-X-Name-First: Andreas Author-X-Name-Last: Buja Author-Name: Linda Zhao Author-X-Name-First: Linda Author-X-Name-Last: Zhao Title: Semi-Supervised Linear Regression Abstract: We study a regression problem where for some part of the data we observe both the label variable (Y) and the predictors (X ), while for other part of the data only the predictors are given. Such a problem arises, for example, when observations of the label variable are costly and may require a skilled human agent. When the conditional expectation E[Y|X] is not exactly linear, one can consider the best linear approximation to the conditional expectation, which can be estimated consistently by the least-square estimates (LSE). The latter depends only on the labeled data. We suggest improved alternative estimates to the LSE that use also the unlabeled data. Our estimation method can be easily implemented and has simply described asymptotic properties. The new estimates asymptotically dominate the usual standard procedures under certain non-linearity condition of E[Y|X] ; otherwise, they are asymptotically equivalent. The performance of the new estimator for small sample size is investigated in an extensive simulation study. A real data example of inferring homeless population is used to illustrate the new methodology. Journal: Journal of the American Statistical Association Pages: 2238-2251 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1915320 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1915320 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2238-2251 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1893179_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: James Matuk Author-X-Name-First: James Author-X-Name-Last: Matuk Author-Name: Karthik Bharath Author-X-Name-First: Karthik Author-X-Name-Last: Bharath Author-Name: Oksana Chkrebtii Author-X-Name-First: Oksana Author-X-Name-Last: Chkrebtii Author-Name: Sebastian Kurtek Author-X-Name-First: Sebastian Author-X-Name-Last: Kurtek Title: Bayesian Framework for Simultaneous Registration and Estimation of Noisy, Sparse, and Fragmented Functional Data Abstract: In many applications, smooth processes generate data that are recorded under a variety of observational regimes, including dense sampling and sparse or fragmented observations that are often contaminated with error. The statistical goal of registering and estimating the individual underlying functions from discrete observations has thus far been mainly approached sequentially without formal uncertainty propagation, or in an application-specific manner by pooling information across subjects. We propose a unified Bayesian framework for simultaneous registration and estimation, which is flexible enough to accommodate inference on individual functions under general observational regimes. Our ability to do this relies on the specification of strongly informative prior models over the amplitude component of function variability using two strategies: a data-driven approach that defines an empirical basis for the amplitude subspace based on training data, and a shape-restricted approach when the relative location and number of extrema is well-understood. The proposed methods build on the elastic functional data analysis framework to separately model amplitude and phase variability inherent in functional data. We emphasize the importance of uncertainty quantification and visualization of these two components as they provide complementary information about the estimated functions. We validate the proposed framework using multiple simulation studies and real applications. Journal: Journal of the American Statistical Association Pages: 1964-1980 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1893179 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1893179 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1964-1980 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1886936_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Thomas W. Yee Author-X-Name-First: Thomas W. Author-X-Name-Last: Yee Title: On the Hauck–Donner Effect in Wald Tests: Detection, Tipping Points, and Parameter Space Characterization Abstract: The Wald test remains ubiquitous in statistical practice despite shortcomings such as its inaccuracy in small samples and lack of invariance under reparameterization. This article develops on another but lesser-known shortcoming called the Hauck–Donner effect (HDE) whereby a Wald test statistic is no longer monotone increasing as a function of increasing distance between the parameter estimate and the null value. Resulting in an upward biased p-value and loss of power, the aberration can lead to very damaging consequences such as in variable selection. The HDE afflicts many types of regression models and corresponds to estimates near the boundary of the parameter space. This article presents several new results, and its main contributions are to (i) propose a very general test for detecting the HDE in the class of vector generalized linear models (VGLMs), regardless of the underlying cause; (ii) fundamentally characterize the HDE by pairwise ratios of Wald and Rao score and likelihood ratio test statistics for 1-parameter distributions with large samples; (iii) show that the parameter space may be partitioned into an interior encased by at least 5 HDE severity measures (faint, weak, moderate, strong, extreme); (iv) prove that a necessary condition for the HDE in a 2 by 2 table is a log odds ratio of at least 2; (v) give some practical guidelines about HDE-free hypothesis testing. Overall, practical post-fit tests can now be conducted potentially to any model estimated by iteratively reweighted least squares, especially the GLM and VGLM classes, the latter which encompasses many popular regression models. Journal: Journal of the American Statistical Association Pages: 1763-1774 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1886936 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1886936 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1763-1774 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1893177_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Xiao Guo Author-X-Name-First: Xiao Author-X-Name-Last: Guo Author-Name: Guang Cheng Author-X-Name-First: Guang Author-X-Name-Last: Cheng Title: Moderate-Dimensional Inferences on Quadratic Functionals in Ordinary Least Squares Abstract: Statistical inferences for quadratic functionals of linear regression parameter have found wide applications including signal detection, global testing, inferences of error variance and fraction of variance explained. Classical theory based on ordinary least squares estimator works perfectly in the low-dimensional regime, but fails when the parameter dimension pn grows proportionally to the sample size n. In some cases, its performance is not satisfactory even when n≥5pn . The main contribution of this article is to develop dimension-adaptive inferences for quadratic functionals when limn→∞pn/n=τ∈[0,1) . We propose a bias-and-variance-corrected test statistic and demonstrate that its theoretical validity (such as consistency and asymptotic normality) is adaptive to both low dimension with τ = 0 and moderate dimension with τ∈(0,1) . Our general theory holds, in particular, without Gaussian design/error or structural parameter assumption, and applies to a broad class of quadratic functionals covering all aforementioned applications. As a by-product, we find that the classical fixed-dimensional results continue to hold if and only if the signal-to-noise ratio is large enough, say when pn diverges but slower than n. Extensive numerical results demonstrate the satisfactory performance of the proposed methodology even when pn≥0.9n in some extreme cases. The mathematical arguments are based on the random matrix theory and leave-one-observation-out method. Journal: Journal of the American Statistical Association Pages: 1931-1950 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1893177 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1893177 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1931-1950 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2089572_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Fei Xue Author-X-Name-First: Fei Author-X-Name-Last: Xue Author-Name: Xiwei Tang Author-X-Name-First: Xiwei Author-X-Name-Last: Tang Author-Name: Grace Kim Author-X-Name-First: Grace Author-X-Name-Last: Kim Author-Name: Karestan C. Koenen Author-X-Name-First: Karestan C. Author-X-Name-Last: Koenen Author-Name: Chantel L. Martin Author-X-Name-First: Chantel L. Author-X-Name-Last: Martin Author-Name: Sandro Galea Author-X-Name-First: Sandro Author-X-Name-Last: Galea Author-Name: Derek Wildman Author-X-Name-First: Derek Author-X-Name-Last: Wildman Author-Name: Monica Uddin Author-X-Name-First: Monica Author-X-Name-Last: Uddin Author-Name: Annie Qu Author-X-Name-First: Annie Author-X-Name-Last: Qu Title: Heterogeneous Mediation Analysis on Epigenomic PTSD and Traumatic Stress in a Predominantly African American Cohort Abstract: DNA methylation (DNAm) has been suggested to play a critical role in post-traumatic stress disorder (PTSD), through mediating the relationship between trauma and PTSD. However, this underlying mechanism of PTSD for African Americans still remains unknown. To fill this gap, in this article, we investigate how DNAm mediates the effects of traumatic experiences on PTSD symptoms in the Detroit Neighborhood Health Study (DNHS) (2008–2013) which involves primarily African Americans adults. To achieve this, we develop a new mediation analysis approach for high-dimensional potential DNAm mediators. A key novelty of our method is that we consider heterogeneity in mediation effects across subpopulations. Specifically, mediators in different subpopulations could have opposite effects on the outcome, and thus could be difficult to identify under a traditional homogeneous model framework. In contrast, the proposed method can estimate heterogeneous mediation effects and identifies subpopulations in which individuals share similar effects. Simulation studies demonstrate that the proposed method outperforms existing methods for both homogeneous and heterogeneous data. We also present our mediation analysis results of a dataset with 125 participants and more than 450,000 CpG sites from the DNHS study. The proposed method finds three subgroups of subjects and identifies DNAm mediators corresponding to genes such as HSP90AA1 and NFATC1 which have been linked to PTSD symptoms in literature. Our finding could be useful in future finer-grained investigation of PTSD mechanism and in the development of new treatments for PTSD. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1669-1683 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2022.2089572 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2089572 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1669-1683 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1917417_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Yumou Qiu Author-X-Name-First: Yumou Author-X-Name-Last: Qiu Author-Name: Xiao-Hua Zhou Author-X-Name-First: Xiao-Hua Author-X-Name-Last: Zhou Title: Inference on Multi-level Partial Correlations Based on Multi-subject Time Series Data Abstract: Partial correlations are commonly used to analyze the conditional dependence among variables. In this work, we propose a hierarchical model to study both the subject- and population-level partial correlations based on multi-subject time-series data. Multiple testing procedures adaptive to temporally dependent data with false discovery proportion control are proposed to identify the nonzero partial correlations in both the subject and population levels. A computationally feasible algorithm is developed. Theoretical results and simulation studies demonstrate the good properties of the proposed procedures. We illustrate the application of the proposed methods in a real example of brain connectivity on fMRI data from normal healthy persons and patients with Parkinson’s disease. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2268-2282 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1917417 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1917417 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2268-2282 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2096618_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Paul De Boeck Author-X-Name-First: Paul Author-X-Name-Last: De Boeck Author-Name: Michael L. DeKay Author-X-Name-First: Michael L. Author-X-Name-Last: DeKay Author-Name: Menglin Xu Author-X-Name-First: Menglin Author-X-Name-Last: Xu Title: The Potential of Factor Analysis for Replication, Generalization, and Integration Journal: Journal of the American Statistical Association Pages: 1622-1626 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2022.2096618 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2096618 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1622-1626 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2087658_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Dengdeng Yu Author-X-Name-First: Dengdeng Author-X-Name-Last: Yu Author-Name: Linbo Wang Author-X-Name-First: Linbo Author-X-Name-Last: Wang Author-Name: Dehan Kong Author-X-Name-First: Dehan Author-X-Name-Last: Kong Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Title: Mapping the Genetic-Imaging-Clinical Pathway with Applications to Alzheimer’s Disease Abstract: Alzheimer’s disease is a progressive form of dementia that results in problems with memory, thinking, and behavior. It often starts with abnormal aggregation and deposition of β amyloid and tau, followed by neuronal damage such as atrophy of the hippocampi, leading to Alzheimer’s disease (AD). The aim of this article is to map the genetic-imaging-clinical pathway for AD in order to delineate the genetically-regulated brain changes that drive disease progression based on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset. We develop a novel two-step approach to delineate the association between high-dimensional 2D hippocampal surface exposures and the Alzheimer’s Disease Assessment Scale (ADAS) cognitive score, while taking into account the ultra-high dimensional clinical and genetic covariates at baseline. Analysis results suggest that the radial distance of each pixel of both hippocampi is negatively associated with the severity of behavioral deficits conditional on observed clinical and genetic covariates. These associations are stronger in Cornu Ammonis region 1 (CA1) and subiculum subregions compared to Cornu Ammonis region 2 (CA2) and Cornu Ammonis region 3 (CA3) subregions. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1656-1668 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2022.2087658 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2087658 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1656-1668 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1887741_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Andrew Zammit-Mangion Author-X-Name-First: Andrew Author-X-Name-Last: Zammit-Mangion Author-Name: Tin Lok James Ng Author-X-Name-First: Tin Lok James Author-X-Name-Last: Ng Author-Name: Quan Vu Author-X-Name-First: Quan Author-X-Name-Last: Vu Author-Name: Maurizio Filippone Author-X-Name-First: Maurizio Author-X-Name-Last: Filippone Title: Deep Compositional Spatial Models Abstract: Spatial processes with nonstationary and anisotropic covariance structure are often used when modeling, analyzing, and predicting complex environmental phenomena. Such processes may often be expressed as ones that have stationary and isotropic covariance structure on a warped spatial domain. However, the warping function is generally difficult to fit and not constrained to be injective, often resulting in “space-folding.” Here, we propose modeling an injective warping function through a composition of multiple elemental injective functions in a deep-learning framework. We consider two cases; first, when these functions are known up to some weights that need to be estimated, and, second, when the weights in each layer are random. Inspired by recent methodological and technological advances in deep learning and deep Gaussian processes, we employ approximate Bayesian methods to make inference with these models using graphics processing units. Through simulation studies in one and two dimensions we show that the deep compositional spatial models are quick to fit, and are able to provide better predictions and uncertainty quantification than other deep stochastic models of similar complexity. We also show their remarkable capacity to model nonstationary, anisotropic spatial data using radiances from the MODIS instrument aboard the Aqua satellite. Journal: Journal of the American Statistical Association Pages: 1787-1808 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1887741 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1887741 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1787-1808 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1887742_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Jingnan Zhang Author-X-Name-First: Jingnan Author-X-Name-Last: Zhang Author-Name: Xin He Author-X-Name-First: Xin Author-X-Name-Last: He Author-Name: Junhui Wang Author-X-Name-First: Junhui Author-X-Name-Last: Wang Title: Directed Community Detection With Network Embedding Abstract: Community detection in network data aims at grouping similar nodes sharing certain characteristics together. Most existing methods focus on detecting communities in undirected networks, where similarity between nodes is measured by their node features and whether they are connected. In this article, we propose a novel method to conduct network embedding and community detection simultaneously in a directed network. The network embedding model introduces two sets of vectors to represent the out- and in-nodes separately, and thus allows the same nodes belong to different out- and in-communities. The community detection formulation equips the negative log-likelihood with a novel regularization term to encourage community structure among the nodes representations, and thus achieves better performance by jointly estimating the nodes embeddings and their community structures. To tackle the resultant optimization task, an efficient alternative updating scheme is developed. More importantly, the asymptotic properties of the proposed method are established in terms of both network embedding and community detection, which are also supported by numerical experiments on some simulated and real examples. Journal: Journal of the American Statistical Association Pages: 1809-1819 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1887742 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1887742 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1809-1819 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1914635_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Quefeng Li Author-X-Name-First: Quefeng Author-X-Name-Last: Li Author-Name: Lexin Li Author-X-Name-First: Lexin Author-X-Name-Last: Li Title: Integrative Factor Regression and Its Inference for Multimodal Data Analysis Abstract: Multimodal data, where different types of data are collected from the same subjects, are fast emerging in a large variety of scientific applications. Factor analysis is commonly used in integrative analysis of multimodal data, and is particularly useful to overcome the curse of high dimensionality and high correlations. However, there is little work on statistical inference for factor analysis-based supervised modeling of multimodal data. In this article, we consider an integrative linear regression model that is built upon the latent factors extracted from multimodal data. We address three important questions: how to infer the significance of one data modality given the other modalities in the model; how to infer the significance of a combination of variables from one modality or across different modalities; and how to quantify the contribution, measured by the goodness of fit, of one data modality given the others. When answering each question, we explicitly characterize both the benefit and the extra cost of factor analysis. Those questions, to our knowledge, have not yet been addressed despite wide use of factor analysis in integrative multimodal analysis, and our proposal bridges an important gap. We study the empirical performance of our methods through simulations, and further illustrate with a multimodal neuroimaging analysis. Journal: Journal of the American Statistical Association Pages: 2207-2221 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1914635 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1914635 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2207-2221 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1917416_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Paromita Dubey Author-X-Name-First: Paromita Author-X-Name-Last: Dubey Author-Name: Hans-Georg Müller Author-X-Name-First: Hans-Georg Author-X-Name-Last: Müller Title: Modeling Time-Varying Random Objects and Dynamic Networks Abstract: Samples of dynamic or time-varying networks and other random object data such as time-varying probability distributions are increasingly encountered in modern data analysis. Common methods for time-varying data such as functional data analysis are infeasible when observations are time courses of networks or other complex non-Euclidean random objects that are elements of general metric spaces. In such spaces, only pairwise distances between the data objects are available and a strong limitation is that one cannot carry out arithmetic operations due to the lack of an algebraic structure. We combat this complexity by a generalized notion of mean trajectory taking values in the object space. For this, we adopt pointwise Fréchet means and then construct pointwise distance trajectories between the individual time courses and the estimated Fréchet mean trajectory, thus representing the time-varying objects and networks by functional data. Functional principal component analysis of these distance trajectories can reveal interesting features of dynamic networks and object time courses and is useful for downstream analysis. Our approach also makes it possible to study the empirical dynamics of time-varying objects, including dynamic regression to the mean or explosive behavior over time. We demonstrate desirable asymptotic properties of sample based estimators for suitable population targets under mild assumptions. The utility of the proposed methodology is illustrated with dynamic networks, time-varying distribution data and longitudinal growth data. Journal: Journal of the American Statistical Association Pages: 2252-2267 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1917416 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1917416 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2252-2267 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1884561_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Changcheng Li Runze Li Author-X-Name-First: Changcheng Li Author-X-Name-Last: Runze Li Title: Linear Hypothesis Testing in Linear Models With High-Dimensional Responses Abstract: In this article, we propose a new projection test for linear hypotheses on regression coefficient matrices in linear models with high-dimensional responses. We systematically study the theoretical properties of the proposed test. We first derive the optimal projection matrix for any given projection dimension to achieve the best power and provide an upper bound for the optimal dimension of projection matrix. We further provide insights into how to construct the optimal projection matrix. One- and two-sample mean problems can be formulated as special cases of linear hypotheses studied in this article. We both theoretically and empirically demonstrate that the proposed test can outperform the existing ones for one- and two-sample mean problems. We conduct Monte Carlo simulation to examine the finite sample performance and illustrate the proposed test by a real data example. Journal: Journal of the American Statistical Association Pages: 1738-1750 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1884561 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1884561 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1738-1750 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1893176_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Jiwei Zhao Author-X-Name-First: Jiwei Author-X-Name-Last: Zhao Author-Name: Yanyuan Ma Author-X-Name-First: Yanyuan Author-X-Name-Last: Ma Title: A Versatile Estimation Procedure Without Estimating the Nonignorable Missingness Mechanism Abstract: We consider the estimation problem in a regression setting where the outcome variable is subject to nonignorable missingness and identifiability is ensured by the shadow variable approach. We propose a versatile estimation procedure where modeling of missingness mechanism is completely bypassed. We show that our estimator is easy to implement and we derive the asymptotic theory of the proposed estimator. We also investigate some alternative estimators under different scenarios. Comprehensive simulation studies are conducted to demonstrate the finite sample performance of the method. We apply the estimator to a children’s mental health study to illustrate its usefulness. Journal: Journal of the American Statistical Association Pages: 1916-1930 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1893176 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1893176 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1916-1930 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1875838_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Sihai Dave Zhao Author-X-Name-First: Sihai Dave Author-X-Name-Last: Zhao Author-Name: William Biscarri Author-X-Name-First: William Author-X-Name-Last: Biscarri Title: A Regression Modeling Approach to Structured Shrinkage Estimation Abstract: Problems involving the simultaneous estimation of multiple parameters arise in many areas of theoretical and applied statistics. A canonical example is the estimation of a vector of normal means. Frequently, structural information about relationships between the parameters of interest is available. For example, in a gene expression denoising problem, genes with similar functions may have similar expression levels. Despite its importance, structural information has not been well-studied in the simultaneous estimation literature, perhaps in part because it poses challenges to the usual geometric or empirical Bayes shrinkage estimation paradigms. This article proposes that some of these challenges can be resolved by adopting an alternate paradigm, based on regression modeling. This approach can naturally incorporate structural information and also motivates new shrinkage estimation and inference procedures. As an illustration, this regression paradigm is used to develop a class of estimators with asymptotic risk optimality properties that perform well in simulations and in denoising gene expression data from a single cell RNA-sequencing experiment. Journal: Journal of the American Statistical Association Pages: 1684-1694 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1875838 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1875838 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1684-1694 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1892703_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Davy Paindaveine Author-X-Name-First: Davy Author-X-Name-Last: Paindaveine Author-Name: Joséa Rasoafaraniaina Author-X-Name-First: Joséa Author-X-Name-Last: Rasoafaraniaina Author-Name: Thomas Verdebout Author-X-Name-First: Thomas Author-X-Name-Last: Verdebout Title: Preliminary Multiple-Test Estimation, With Applications to k-Sample Covariance Estimation Abstract: Multisample covariance estimation—that is, estimation of the covariance matrices associated with k distinct populations—is a classical problem in multivariate statistics. A common solution is to base estimation on the outcome of a test that these covariance matrices show some given pattern. Such a preliminary test may, for example, investigate whether or not the various covariance matrices are equal to each other (test of homogeneity), or whether or not they have common eigenvectors (test of common principal components), etc. Since it is usually unclear what the possible pattern might be, it is natural to consider a collection of such patterns, leading to a collection of preliminary tests, and to base estimation on the outcome of such a multiple testing rule. In the present work, we therefore study preliminary test estimation based on multiple tests. Since this is of interest also outside k-sample covariance estimation, we do so in a very general framework where it is only assumed that the sequence of models at hand is locally asymptotically normal. In this general setup, we define the proposed estimators and derive their asymptotic properties. We come back to k-sample covariance estimation to illustrate the asymptotic and finite-sample behaviors of our estimators. Finally, we treat a real data example that allows us to show their practical relevance in a supervised classification framework. Journal: Journal of the American Statistical Association Pages: 1904-1915 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1892703 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1892703 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1904-1915 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1884562_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Yangfan Zhang Author-X-Name-First: Yangfan Author-X-Name-Last: Zhang Author-Name: Runmin Wang Author-X-Name-First: Runmin Author-X-Name-Last: Wang Author-Name: Xiaofeng Shao Author-X-Name-First: Xiaofeng Author-X-Name-Last: Shao Title: Adaptive Inference for Change Points in High-Dimensional Data Abstract: In this article, we propose a class of test statistics for a change point in the mean of high-dimensional independent data. Our test integrates the U-statistic based approach in a recent work by Wang et al. and the Lq-norm based high-dimensional test in a recent work by He et al., and inherits several appealing features such as being tuning parameter free and asymptotic independence for test statistics corresponding to even q’s. A simple combination of test statistics corresponding to several different q’s leads to a test with adaptive power property, that is, it can be powerful against both sparse and dense alternatives. On the estimation front, we obtain the convergence rate of the maximizer of our test statistic standardized by sample size when there is one change-point in mean and q = 2, and propose to combine our tests with a wild binary segmentation algorithm to estimate the change-point number and locations when there are multiple change-points. Numerical comparisons using both simulated and real data demonstrate the advantage of our adaptive test and its corresponding estimation method. Journal: Journal of the American Statistical Association Pages: 1751-1762 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1884562 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1884562 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1751-1762 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1891925_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Xi Chen Author-X-Name-First: Xi Author-X-Name-Last: Chen Author-Name: Weidong Liu Author-X-Name-First: Weidong Author-X-Name-Last: Liu Author-Name: Yichen Zhang Author-X-Name-First: Yichen Author-X-Name-Last: Zhang Title: First-Order Newton-Type Estimator for Distributed Estimation and Inference Abstract: This article studies distributed estimation and inference for a general statistical problem with a convex loss that could be nondifferentiable. For the purpose of efficient computation, we restrict ourselves to stochastic first-order optimization, which enjoys low per-iteration complexity. To motivate the proposed method, we first investigate the theoretical properties of a straightforward divide-and-conquer stochastic gradient descent approach. Our theory shows that there is a restriction on the number of machines and this restriction becomes more stringent when the dimension p is large. To overcome this limitation, this article proposes a new multi-round distributed estimation procedure that approximates the Newton step only using stochastic subgradient. The key component in our method is the proposal of a computationally efficient estimator of Σ−1w , where Σ is the population Hessian matrix and w is any given vector. Instead of estimating Σ (or Σ−1 ) that usually requires the second-order differentiability of the loss, the proposed first-order Newton-type estimator (FONE) directly estimates the vector of interest Σ−1w as a whole and is applicable to nondifferentiable losses. Our estimator also facilitates the inference for the empirical risk minimizer. It turns out that the key term in the limiting covariance has the form of Σ−1w , which can be estimated by FONE. Journal: Journal of the American Statistical Association Pages: 1858-1874 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1891925 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1891925 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1858-1874 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1895176_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Stéphane Guerrier Author-X-Name-First: Stéphane Author-X-Name-Last: Guerrier Author-Name: Roberto Molinari Author-X-Name-First: Roberto Author-X-Name-Last: Molinari Author-Name: Maria-Pia Victoria-Feser Author-X-Name-First: Maria-Pia Author-X-Name-Last: Victoria-Feser Author-Name: Haotian Xu Author-X-Name-First: Haotian Author-X-Name-Last: Xu Title: Robust Two-Step Wavelet-Based Inference for Time Series Models Abstract: Latent time series models such as (the independent sum of) ARMA(p, q) models with additional stochastic processes are increasingly used for data analysis in biology, ecology, engineering, and economics. Inference on and/or prediction from these models can be highly challenging: (i) the data may contain outliers that can adversely affect the estimation procedure; (ii) the computational complexity can become prohibitive when the time series are extremely large; (iii) model selection adds another layer of (computational) complexity; and (iv) solutions that address (i), (ii), and (iii) simultaneously do not exist in practice. This paper aims at jointly addressing these challenges by proposing a general framework for robust two-step estimation based on a bounded influence M-estimator of the wavelet variance. We first develop the conditions for the joint asymptotic normality of the latter estimator thereby providing the necessary tools to perform (direct) inference for scale-based analysis of signals. Taking advantage of the model-independent weights of this first-step estimator, we then develop the asymptotic properties of two-step robust estimators using the framework of the generalized method of wavelet moments (GMWM). Simulation studies illustrate the good finite sample performance of the robust GMWM estimator and applied examples highlight the practical relevance of the proposed approach. Journal: Journal of the American Statistical Association Pages: 1996-2013 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1895176 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1895176 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1996-2013 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2139707_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Zixiao Wang Author-X-Name-First: Zixiao Author-X-Name-Last: Wang Author-Name: Yi Feng Author-X-Name-First: Yi Author-X-Name-Last: Feng Author-Name: Lin Liu Author-X-Name-First: Lin Author-X-Name-Last: Liu Title: Semiparametric Regression with R Journal: Journal of the American Statistical Association Pages: 2283-2287 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2022.2139707 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2139707 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2283-2287 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1893178_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949 Author-Name: Likai Chen Author-X-Name-First: Likai Author-X-Name-Last: Chen Author-Name: Weining Wang Author-X-Name-First: Weining Author-X-Name-Last: Wang Author-Name: Wei Biao Wu Author-X-Name-First: Wei Biao Author-X-Name-Last: Wu Title: Inference of Breakpoints in High-dimensional Time Series Abstract: For multiple change-points detection of high-dimensional time series, we provide asymptotic theory concerning the consistency and the asymptotic distribution of the breakpoint statistics and estimated break sizes. The theory backs up a simple two-step procedure for detecting and estimating multiple change-points. The proposed two-step procedure involves the maximum of a MOSUM (moving sum) type statistics in the first step and a CUSUM (cumulative sum) refinement step on an aggregated time series in the second step. Thus, for a fixed time-point, we can capture both the biggest break across different coordinates and aggregating simultaneous breaks over multiple coordinates. Extending the existing high-dimensional Gaussian approximation theorem to dependent data with jumps, the theory allows us to characterize the size and power of our multiple change-point test asymptotically. Moreover, we can make inferences on the breakpoints estimates when the break sizes are small. Our theoretical setup incorporates both weak temporal and strong or weak cross-sectional dependence and is suitable for heavy-tailed innovations. A robust long-run covariance matrix estimation is proposed, which can be of independent interest. An application on detecting structural changes of the U.S. unemployment rate is considered to illustrate the usefulness of our method. Journal: Journal of the American Statistical Association Pages: 1951-1963 Issue: 540 Volume: 117 Year: 2022 Month: 10 X-DOI: 10.1080/01621459.2021.1893178 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1893178 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1951-1963 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1917418_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Xiao Han Author-X-Name-First: Xiao Author-X-Name-Last: Han Author-Name: Xin Tong Author-X-Name-First: Xin Author-X-Name-Last: Tong Author-Name: Yingying Fan Author-X-Name-First: Yingying Author-X-Name-Last: Fan Title: Eigen Selection in Spectral Clustering: A Theory-Guided Practice Abstract: Based on a Gaussian mixture type model of K components, we derive eigen selection procedures that improve the usual spectral clustering algorithms in high-dimensional settings, which typically act on the top few eigenvectors of an affinity matrix (e.g., X⊤X ) derived from the data matrix X . Our selection principle formalizes two intuitions: (i) eigenvectors should be dropped when they have no clustering power; (ii) some eigenvectors corresponding to smaller spiked eigenvalues should be dropped due to estimation inaccuracy. Our selection procedures lead to new spectral clustering algorithms: ESSC for K = 2 and GESSC for K > 2. The newly proposed algorithms enjoy better stability and compare favorably against canonical alternatives, as demonstrated in extensive simulation and multiple real data studies. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 109-121 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1917418 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1917418 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:109-121 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2115916_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Moo K. Chung Author-X-Name-First: Moo K. Author-X-Name-Last: Chung Author-Name: Jamie L. Hanson Author-X-Name-First: Jamie L. Author-X-Name-Last: Hanson Author-Name: Richard J. Davidson Author-X-Name-First: Richard J. Author-X-Name-Last: Davidson Author-Name: Seth D. Pollak Author-X-Name-First: Seth D. Author-X-Name-Last: Pollak Title: Discussion of “LESA: Longitudinal Elastic Shape Analysis of Brain Subcortical Structures” Journal: Journal of the American Statistical Association Pages: 20-21 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2022.2115916 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2115916 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:20-21 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1953507_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Hao Chen Author-X-Name-First: Hao Author-X-Name-Last: Chen Author-Name: Yin Xia Author-X-Name-First: Yin Author-X-Name-Last: Xia Title: A Normality Test for High-dimensional Data Based on the Nearest Neighbor Approach Abstract: Many statistical methodologies for high-dimensional data assume the population is normal. Although a few multivariate normality tests have been proposed, to the best of our knowledge, none of them can properly control the Type I error when the dimension is larger than the number of observations. In this work, we propose a novel nonparametric test that uses the nearest neighbor information. The proposed method guarantees the asymptotic Type I error control under the high-dimensional setting. Simulation studies verify the empirical size performance of the proposed test when the dimension grows with the sample size and at the same time exhibit a superior power performance of the new test compared with alternative methods. We also illustrate our approach through two popularly used datasets in high-dimensional classification and clustering literatures where deviation from the normality assumption may lead to invalid conclusions. Journal: Journal of the American Statistical Association Pages: 719-731 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1953507 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1953507 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:719-731 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1942012_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Jingyu He Author-X-Name-First: Jingyu Author-X-Name-Last: He Author-Name: P. Richard Hahn Author-X-Name-First: P. Richard Author-X-Name-Last: Hahn Title: Stochastic Tree Ensembles for Regularized Nonlinear Regression Abstract: This article develops a novel stochastic tree ensemble method for nonlinear regression, referred to as accelerated Bayesian additive regression trees, or XBART. By combining regularization and stochastic search strategies from Bayesian modeling with computationally efficient techniques from recursive partitioning algorithms, XBART attains state-of-the-art performance at prediction and function estimation. Simulation studies demonstrate that XBART provides accurate point-wise estimates of the mean function and does so faster than popular alternatives, such as BART, XGBoost, and neural networks (using Keras) on a variety of test functions. Additionally, it is demonstrated that using XBART to initialize the standard BART MCMC algorithm considerably improves credible interval coverage and reduces total run-time. Finally, two basic theoretical results are established: the single tree version of the model is asymptotically consistent and the Markov chain produced by the ensemble version of the algorithm has a unique stationary distribution. Journal: Journal of the American Statistical Association Pages: 551-570 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1942012 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1942012 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:551-570 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2173603_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: The Editors Title: Correction to “Modeling Time-Varying Random Objects and Dynamic Networks” Journal: Journal of the American Statistical Association Pages: 778-778 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2023.2173603 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2173603 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:778-778 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1930547_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Wanchuang Zhu Author-X-Name-First: Wanchuang Author-X-Name-Last: Zhu Author-Name: Yingkai Jiang Author-X-Name-First: Yingkai Author-X-Name-Last: Jiang Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Author-Name: Ke Deng Author-X-Name-First: Ke Author-X-Name-Last: Deng Title: Partition–Mallows Model and Its Inference for Rank Aggregation Abstract: Learning how to aggregate ranking lists has been an active research area for many years and its advances have played a vital role in many applications ranging from bioinformatics to internet commerce. The problem of discerning reliability of rankers based only on the rank data is of great interest to many practitioners, but has received less attention from researchers. By dividing the ranked entities into two disjoint groups, that is, relevant and irrelevant/background ones, and incorporating the Mallows model for the relative ranking of relevant entities, we propose a framework for rank aggregation that can not only distinguish quality differences among the rankers but also provide the detailed ranking information for relevant entities. Theoretical properties of the proposed approach are established, and its advantages over existing approaches are demonstrated via simulation studies and real-data applications. Extensions of the proposed method to handle partial ranking lists and conduct covariate-assisted rank aggregation are also discussed. Journal: Journal of the American Statistical Association Pages: 343-359 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1930547 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1930547 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:343-359 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1920958_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Eugene Katsevich Author-X-Name-First: Eugene Author-X-Name-Last: Katsevich Author-Name: Chiara Sabatti Author-X-Name-First: Chiara Author-X-Name-Last: Sabatti Author-Name: Marina Bogomolov Author-X-Name-First: Marina Author-X-Name-Last: Bogomolov Title: Filtering the Rejection Set While Preserving False Discovery Rate Control Abstract: Scientific hypotheses in a variety of applications have domain-specific structures, such as the tree structure of the international classification of diseases (ICD), the directed acyclic graph structure of the gene ontology (GO), or the spatial structure in genome-wide association studies. In the context of multiple testing, the resulting relationships among hypotheses can create redundancies among rejections that hinder interpretability. This leads to the practice of filtering rejection sets obtained from multiple testing procedures, which may in turn invalidate their inferential guarantees. We propose Focused BH, a simple, flexible, and principled methodology to adjust for the application of any prespecified filter. We prove that Focused BH controls the false discovery rate under various conditions, including when the filter satisfies an intuitive monotonicity property and the p-values are positively dependent. We demonstrate in simulations that Focused BH performs well across a variety of settings, and illustrate this method’s practical utility via analyses of real datasets based on ICD and GO. Journal: Journal of the American Statistical Association Pages: 165-176 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1920958 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1920958 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:165-176 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2173458_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Michael L. Stein Author-X-Name-First: Michael L. Author-X-Name-Last: Stein Title: Editorial: What Makes for a Great Applications and Case Studies Paper? Journal: Journal of the American Statistical Association Pages: 1-2 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2023.2173458 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2173458 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:1-2 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1927741_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Arun K. Kuchibhotla Author-X-Name-First: Arun K. Author-X-Name-Last: Kuchibhotla Author-Name: Rohit K. Patra Author-X-Name-First: Rohit K. Author-X-Name-Last: Patra Author-Name: Bodhisattva Sen Author-X-Name-First: Bodhisattva Author-X-Name-Last: Sen Title: Semiparametric Efficiency in Convexity Constrained Single-Index Model Abstract: We consider estimation and inference in a single-index regression model with an unknown convex link function. We introduce a convex and Lipschitz constrained least-square estimator (CLSE) for both the parametric and the nonparametric components given independent and identically distributed observations. We prove the consistency and find the rates of convergence of the CLSE when the errors are assumed to have only q≥2 moments and are allowed to depend on the covariates. When q≥5 , we establish n−1/2 -rate of convergence and asymptotic normality of the estimator of the parametric component. Moreover, the CLSE is proved to be semiparametrically efficient if the errors happen to be homoscedastic. We develop and implement a numerically stable and computationally fast algorithm to compute our proposed estimator in the R package simest. We illustrate our methodology through extensive simulations and data analysis. Finally, our proof of efficiency is geometric and provides a general framework that can be used to prove efficiency of estimators in a wide variety of semiparametric models even when they do not satisfy the efficient score equation directly. Supplementary files for this article are available online. Journal: Journal of the American Statistical Association Pages: 272-286 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1927741 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1927741 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:272-286 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1955688_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Bowen Gang Author-X-Name-First: Bowen Author-X-Name-Last: Gang Author-Name: Wenguang Sun Author-X-Name-First: Wenguang Author-X-Name-Last: Sun Author-Name: Weinan Wang Author-X-Name-First: Weinan Author-X-Name-Last: Wang Title: Structure–Adaptive Sequential Testing for Online False Discovery Rate Control Abstract: Consider the online testing of a stream of hypotheses where a real-time decision must be made before the next data point arrives. The error rate is required to be controlled at all decision points. Conventional simultaneous testing rules are no longer applicable due to the more stringent error constraints and absence of future data. Moreover, the online decision-making process may come to a halt when the total error budget, or alpha-wealth, is exhausted. This work develops a new class of structure-adaptive sequential testing (SAST) rules for online false discovery rate (FDR) control. A key element in our proposal is a new alpha-investing algorithm that precisely characterizes the gains and losses in sequential decision making. SAST captures time varying structures of the data stream, learns the optimal threshold adaptively in an ongoing manner and optimizes the alpha-wealth allocation across different time periods. We present theory and numerical results to show that SAST is asymptotically valid for online FDR control and achieves substantial power gain over existing online testing rules. Journal: Journal of the American Statistical Association Pages: 732-745 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1955688 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1955688 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:732-745 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1938083_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yujia Deng Author-X-Name-First: Yujia Author-X-Name-Last: Deng Author-Name: Xiwei Tang Author-X-Name-First: Xiwei Author-X-Name-Last: Tang Author-Name: Annie Qu Author-X-Name-First: Annie Author-X-Name-Last: Qu Title: Correlation Tensor Decomposition and Its Application in Spatial Imaging Data Abstract: Multi-dimensional tensor data have gained increasing attention in the recent years, especially in biomedical imaging analyses. However, the most existing tensor models are only based on the mean information of imaging pixels. Motivated by multimodal optical imaging data in a breast cancer study, we develop a new tensor learning approach to use pixel-wise correlation information, which is represented through the higher order correlation tensor. We proposed a novel semi-symmetric correlation tensor decomposition method which effectively captures the informative spatial patterns of pixel-wise correlations to facilitate cancer diagnosis. We establish the theoretical properties for recovering structure and for classification consistency. In addition, we develop an efficient algorithm to achieve computational scalability. Our simulation studies and an application on breast cancer imaging data all indicate that the proposed method outperforms other competing methods in terms of pattern recognition and prediction accuracy. Journal: Journal of the American Statistical Association Pages: 440-456 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1938083 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1938083 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:440-456 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1938582_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Alexander Henzi Author-X-Name-First: Alexander Author-X-Name-Last: Henzi Author-Name: Gian-Reto Kleger Author-X-Name-First: Gian-Reto Author-X-Name-Last: Kleger Author-Name: Johanna F. Ziegel Author-X-Name-First: Johanna F. Author-X-Name-Last: Ziegel Title: Distributional (Single) Index Models Abstract: A Distributional (Single) Index Model (DIM) is a semiparametric model for distributional regression, that is, estimation of conditional distributions given covariates. The method is a combination of classical single-index models for the estimation of the conditional mean of a response given covariates, and isotonic distributional regression. The model for the index is parametric, whereas the conditional distributions are estimated nonparametrically under a stochastic ordering constraint. We show consistency of our estimators and apply them to a highly challenging dataset on the length of stay (LoS) of patients in intensive care units. We use the model to provide skillful and calibrated probabilistic predictions for the LoS of individual patients, which outperform the available methods in the literature. Journal: Journal of the American Statistical Association Pages: 489-503 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1938582 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1938582 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:489-503 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2120399_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: John A. D. Aston Author-X-Name-First: John A. D. Author-X-Name-Last: Aston Author-Name: Eardi Lila Author-X-Name-First: Eardi Author-X-Name-Last: Lila Title: Discussion of LESA: Longitudinal Elastic Shape Analysis of Brain Subcortical Structures Journal: Journal of the American Statistical Association Pages: 18-19 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2022.2120399 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2120399 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:18-19 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2102984_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Zhengwu Zhang Author-X-Name-First: Zhengwu Author-X-Name-Last: Zhang Author-Name: Yuexuan Wu Author-X-Name-First: Yuexuan Author-X-Name-Last: Wu Author-Name: Di Xiong Author-X-Name-First: Di Author-X-Name-Last: Xiong Author-Name: Joseph G. Ibrahim Author-X-Name-First: Joseph G. Author-X-Name-Last: Ibrahim Author-Name: Anuj Srivastava Author-X-Name-First: Anuj Author-X-Name-Last: Srivastava Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Title: LESA: Longitudinal Elastic Shape Analysis of Brain Subcortical Structures Abstract: Over the past 30 years, magnetic resonance imaging has become a ubiquitous tool for accurately visualizing the change and development of the brain’s subcortical structures (e.g., hippocampus). Although subcortical structures act as information hubs of the nervous system, their quantification is still in its infancy due to many challenges in shape extraction, representation, and modeling. Here, we develop a simple and efficient framework of longitudinal elastic shape analysis (LESA) for subcortical structures. Integrating ideas from elastic shape analysis of static surfaces and statistical modeling of sparse longitudinal data, LESA provides a set of tools for systematically quantifying changes of longitudinal subcortical surface shapes from raw structure MRI data. The key novelties of LESA include: (i) it can efficiently represent complex subcortical structures using a small number of basis functions and (ii) it can accurately delineate the spatiotemporal shape changes of the human subcortical structures. We applied LESA to analyze three longitudinal neuroimaging datasets and showcase its wide applications in estimating continuous shape trajectories, building life-span growth patterns, and comparing shape differences among different groups. In particular, with the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data, we found that Alzheimer’s Disease (AD) can significantly speed the shape change of the lateral ventricle and the hippocampus from 60 to 75 years olds compared with normal aging. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 3-17 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2022.2102984 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2102984 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:3-17 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1930546_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Jue Hou Author-X-Name-First: Jue Author-X-Name-Last: Hou Author-Name: Jelena Bradic Author-X-Name-First: Jelena Author-X-Name-Last: Bradic Author-Name: Ronghui Xu Author-X-Name-First: Ronghui Author-X-Name-Last: Xu Title: Treatment Effect Estimation Under Additive Hazards Models With High-Dimensional Confounding Abstract: Estimating treatment effects for survival outcomes in the high-dimensional setting is critical for many biomedical applications and any application with censored observations. This article establishes an “orthogonal” score for learning treatment effects, using observational data with a potentially large number of confounders. The estimator allows for root-n, asymptotically valid confidence intervals, despite the bias induced by the regularization. Moreover, we develop a novel hazard difference (HDi), estimator. We establish rate double robustness through the cross-fitting formulation. Numerical experiments illustrate the finite sample performance, where we observe that the cross-fitted HDi estimator has the best performance. We study the radical prostatectomy’s effect on conservative prostate cancer management through the SEER-Medicare linked data. Last, we provide an extension to machine learning both approaches and heterogeneous treatment effects. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 327-342 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1930546 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1930546 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:327-342 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1941053_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Kevin Guo Author-X-Name-First: Kevin Author-X-Name-Last: Guo Author-Name: Guillaume Basse Author-X-Name-First: Guillaume Author-X-Name-Last: Basse Title: The Generalized Oaxaca-Blinder Estimator Abstract: After performing a randomized experiment, researchers often use ordinary least-square (OLS) regression to adjust for baseline covariates when estimating the average treatment effect. It is widely known that the resulting confidence interval is valid even if the linear model is misspecified. In this article, we generalize that conclusion to covariate adjustment with nonlinear models. We introduce an intuitive way to use any “simple” nonlinear model to construct a covariate-adjusted confidence interval for the average treatment effect. The confidence interval derives its validity from randomization alone, and when nonlinear models fit the data better than linear models, it is narrower than the usual interval from OLS adjustment. Journal: Journal of the American Statistical Association Pages: 524-536 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1941053 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1941053 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:524-536 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1919122_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Jie Chen Author-X-Name-First: Jie Author-X-Name-Last: Chen Author-Name: Michael L. Stein Author-X-Name-First: Michael L. Author-X-Name-Last: Stein Title: Linear-Cost Covariance Functions for Gaussian Random Fields Abstract: Gaussian random fields (GRF) are a fundamental stochastic model for spatiotemporal data analysis. An essential ingredient of GRF is the covariance function that characterizes the joint Gaussian distribution of the field. Commonly used covariance functions give rise to fully dense and unstructured covariance matrices, for which required calculations are notoriously expensive to carry out for large data. In this work, we propose a construction of covariance functions that result in matrices with a hierarchical structure. Empowered by matrix algorithms that scale linearly with the matrix dimension, the hierarchical structure is proved to be efficient for a variety of random field computations, including sampling, kriging, and likelihood evaluation. Specifically, with n scattered sites, sampling and likelihood evaluation has an O(n) cost and kriging has an O( log n) cost after preprocessing, particularly favorable for the kriging of an extremely large number of sites (e.g., predicting on more sites than observed). We demonstrate comprehensive numerical experiments to show the use of the constructed covariance functions and their appealing computation time. Numerical examples on a laptop include simulated data of size up to one million, as well as a climate data product with over two million observations. Journal: Journal of the American Statistical Association Pages: 147-164 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1919122 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1919122 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:147-164 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1935268_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Cheng-Yu Sun Author-X-Name-First: Cheng-Yu Author-X-Name-Last: Sun Author-Name: Boxin Tang Author-X-Name-First: Boxin Author-X-Name-Last: Tang Title: Uniform Projection Designs and Strong Orthogonal Arrays Abstract: We explore the connections between uniform projection designs and strong orthogonal arrays of strength 2+ in this article. Both of these classes of designs are suitable designs for computer experiments and space-filling in two-dimensional margins, but they are motivated by different considerations. Uniform projection designs are introduced by Sun, Wang, and Xu to capture two-dimensional uniformity using the centered L2-discrepancy whereas strong orthogonal arrays of strength 2+ are brought forth by He, Cheng, and Tang as they achieve stratifications in two-dimensions on finer grids than ordinary orthogonal arrays. We first derive a new expression for the centered L2-discrepancy, which gives a decomposition of the criterion into a sum of squares where each square measures one aspect of design uniformity. This result is not only insightful in itself but also allows us to study strong orthogonal arrays in terms of the discrepancy criterion. More specifically, we show that strong orthogonal arrays of strength 2+ are optimal or nearly optimal under the uniform projection criterion. Journal: Journal of the American Statistical Association Pages: 417-423 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1935268 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1935268 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:417-423 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2163898_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Sabrina Giordano Author-X-Name-First: Sabrina Author-X-Name-Last: Giordano Title: Data Science Ethics: Concepts, Techniques and Cautionary Tales Journal: Journal of the American Statistical Association Pages: 774-776 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2022.2163898 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2163898 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:774-776 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1955689_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yuqi Gu Author-X-Name-First: Yuqi Author-X-Name-Last: Gu Author-Name: Gongjun Xu Author-X-Name-First: Gongjun Author-X-Name-Last: Xu Title: A Joint MLE Approach to Large-Scale Structured Latent Attribute Analysis Abstract: Structured latent attribute models (SLAMs) are a family of discrete latent variable models widely used in education, psychology, and epidemiology to model multivariate categorical data. A SLAM assumes that multiple discrete latent attributes explain the dependence of observed variables in a highly structured fashion. Usually, the maximum marginal likelihood estimation approach is adopted for SLAMs, treating the latent attributes as random effects. The increasing scope of modern assessment data involves large numbers of observed variables and high-dimensional latent attributes. This poses challenges to classical estimation methods and requires new methodology and understanding of latent variable modeling. Motivated by this, we consider the joint maximum likelihood estimation (MLE) approach to SLAMs, treating latent attributes as fixed unknown parameters. We investigate estimability, consistency, and computation in the regime where sample size, number of variables, and number of latent attributes all can diverge. We establish the statistical consistency of the joint MLE and propose efficient algorithms that scale well to large-scale data for several popular SLAMs. Simulation studies demonstrate the superior empirical performance of the proposed methods. An application to real data from an international educational assessment gives interpretable findings of cognitive diagnosis. Journal: Journal of the American Statistical Association Pages: 746-760 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1955689 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1955689 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:746-760 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1923508_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Nabarun Deb Author-X-Name-First: Nabarun Author-X-Name-Last: Deb Author-Name: Bodhisattva Sen Author-X-Name-First: Bodhisattva Author-X-Name-Last: Sen Title: Multivariate Rank-Based Distribution-Free Nonparametric Testing Using Measure Transportation Abstract: In this article, we propose a general framework for distribution-free nonparametric testing in multi-dimensions, based on a notion of multivariate ranks defined using the theory of measure transportation. Unlike other existing proposals in the literature, these multivariate ranks share a number of useful properties with the usual one-dimensional ranks; most importantly, these ranks are distribution-free. This crucial observation allows us to design nonparametric tests that are exactly distribution-free under the null hypothesis. We demonstrate the applicability of this approach by constructing exact distribution-free tests for two classical nonparametric problems: (I) testing for mutual independence between random vectors, and (II) testing for the equality of multivariate distributions. In particular, we propose (multivariate) rank versions of distance covariance and energy statistic for testing scenarios (I) and (II), respectively. In both these problems, we derive the asymptotic null distribution of the proposed test statistics. We further show that our tests are consistent against all fixed alternatives. Moreover, the proposed tests are computationally feasible and are well-defined under minimal assumptions on the underlying distributions (e.g., they do not need any moment assumptions). We also demonstrate the efficacy of these procedures via extensive simulations. In the process of analyzing the theoretical properties of our procedures, we end up proving some new results in the theory of measure transportation and in the limit theory of permutation statistics using Stein’s method for exchangeable pairs, which may be of independent interest. Journal: Journal of the American Statistical Association Pages: 192-207 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1923508 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1923508 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:192-207 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1920959_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Zhenhua Lin Author-X-Name-First: Zhenhua Author-X-Name-Last: Lin Author-Name: Miles E. Lopes Author-X-Name-First: Miles E. Author-X-Name-Last: Lopes Author-Name: Hans-Georg Müller Author-X-Name-First: Hans-Georg Author-X-Name-Last: Müller Title: High-Dimensional MANOVA Via Bootstrapping and Its Application to Functional and Sparse Count Data Abstract: We propose a new approach to the problem of high-dimensional multivariate ANOVA via bootstrapping max statistics that involve the differences of sample mean vectors. The proposed method proceeds via the construction of simultaneous confidence regions for the differences of population mean vectors. It is suited to simultaneously test the equality of several pairs of mean vectors of potentially more than two populations. By exploiting the variance decay property that is a natural feature in relevant applications, we are able to provide dimension-free and nearly parametric convergence rates for Gaussian approximation, bootstrap approximation, and the size of the test. We demonstrate the proposed approach with ANOVA problems for functional data and sparse count data. The proposed methodology is shown to work well in simulations and several real data applications. Journal: Journal of the American Statistical Association Pages: 177-191 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1920959 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1920959 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:177-191 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1947306_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Chencheng Cai Author-X-Name-First: Chencheng Author-X-Name-Last: Cai Author-Name: Rong Chen Author-X-Name-First: Rong Author-X-Name-Last: Chen Author-Name: Min-ge Xie Author-X-Name-First: Min-ge Author-X-Name-Last: Xie Title: Individualized Group Learning Abstract: Many massive data sets are assembled through collections of information of a large number of individuals in a population. The analysis of such data, especially in the aspect of individualized inferences and solutions, has the potential to create significant value for practical applications. Traditionally, inference for an individual in the dataset is either solely relying on the information of the individual or from summarizing the information about the whole population. However, with the availability of big data, we have the opportunity, as well as a unique challenge, to make a more effective individualized inference that takes into consideration of both the population information and the individual discrepancy. To deal with the possible heterogeneity within the population while providing effective and credible inferences for individuals in a dataset, this article develops a new approach called the individualized group learning (iGroup). The iGroup approach uses local nonparametric techniques to generate an individualized group by pooling other entities in the population which share similar characteristics with the target individual, even when individual estimates are biased due to limited number of observations. Three general cases of iGroup are discussed, and their asymptotic performances are investigated. Both theoretical results and empirical simulations reveal that, by applying iGroup, the performance of statistical inference on the individual level are ensured and can be substantially improved from inference based on either solely individual information or entire population information. The method has a broad range of applications. An example in financial statistics is presented. Journal: Journal of the American Statistical Association Pages: 622-638 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1947306 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1947306 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:622-638 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1918130_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Tao Zhang Author-X-Name-First: Tao Author-X-Name-Last: Zhang Author-Name: Kengo Kato Author-X-Name-First: Kengo Author-X-Name-Last: Kato Author-Name: David Ruppert Author-X-Name-First: David Author-X-Name-Last: Ruppert Title: Bootstrap Inference for Quantile-based Modal Regression Abstract: In this article, we develop uniform inference methods for the conditional mode based on quantile regression. Specifically, we propose to estimate the conditional mode by minimizing the derivative of the estimated conditional quantile function defined by smoothing the linear quantile regression estimator, and develop two bootstrap methods, a novel pivotal bootstrap and the nonparametric bootstrap, for our conditional mode estimator. Building on high-dimensional Gaussian approximation techniques, we establish the validity of simultaneous confidence rectangles constructed from the two bootstrap methods for the conditional mode. We also extend the preceding analysis to the case where the dimension of the covariate vector is increasing with the sample size. Finally, we conduct simulation experiments and a real data analysis using the U.S. wage data to demonstrate the finite sample performance of our inference method. The supplemental materials include the wage dataset, R codes and an appendix containing proofs of the main results, additional simulation results, discussion of model misspecification and quantile crossing, and additional details of the numerical implementation. Journal: Journal of the American Statistical Association Pages: 122-134 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1918130 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1918130 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:122-134 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1945459_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Lilun Du Author-X-Name-First: Lilun Author-X-Name-Last: Du Author-Name: Xu Guo Author-X-Name-First: Xu Author-X-Name-Last: Guo Author-Name: Wenguang Sun Author-X-Name-First: Wenguang Author-X-Name-Last: Sun Author-Name: Changliang Zou Author-X-Name-First: Changliang Author-X-Name-Last: Zou Title: False Discovery Rate Control Under General Dependence By Symmetrized Data Aggregation Abstract: We develop a new class of distribution-free multiple testing rules for false discovery rate (FDR) control under general dependence. A key element in our proposal is a symmetrized data aggregation (SDA) approach to incorporating the dependence structure via sample splitting, data screening, and information pooling. The proposed SDA filter first constructs a sequence of ranking statistics that fulfill global symmetry properties, and then chooses a data-driven threshold along the ranking to control the FDR. The SDA filter substantially outperforms the Knockoff method in power under moderate to strong dependence, and is more robust than existing methods based on asymptotic p-values. We first develop finite-sample theories to provide an upper bound for the actual FDR under general dependence, and then establish the asymptotic validity of SDA for both the FDR and false discovery proportion control under mild regularity conditions. The procedure is implemented in the R package sdafilter. Numerical results confirm the effectiveness and robustness of SDA in FDR control and show that it achieves substantial power gain over existing methods in many settings. Journal: Journal of the American Statistical Association Pages: 607-621 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1945459 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1945459 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:607-621 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1929246_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yunan Wu Author-X-Name-First: Yunan Author-X-Name-Last: Wu Author-Name: Lan Wang Author-X-Name-First: Lan Author-X-Name-Last: Wang Author-Name: Haoda Fu Author-X-Name-First: Haoda Author-X-Name-Last: Fu Title: Model-Assisted Uniformly Honest Inference for Optimal Treatment Regimes in High Dimension Abstract: This article develops new tools to quantify uncertainty in optimal decision making and to gain insight into which variables one should collect information about given the potential cost of measuring a large number of variables. We investigate simultaneous inference to determine if a group of variables is relevant for estimating an optimal decision rule in a high-dimensional semiparametric framework. The unknown link function permits flexible modeling of the interactions between the treatment and the covariates, but leads to nonconvex estimation in high dimension and imposes significant challenges for inference. We first establish that a local restricted strong convexity condition holds with high probability and that any feasible local sparse solution of the estimation problem can achieve the near-oracle estimation error bound. We further rigorously verify that a wild bootstrap procedure based on a debiased version of the local solution can provide asymptotically honest uniform inference for the effect of a group of variables on optimal decision making. The advantage of honest inference is that it does not require the initial estimator to achieve perfect model selection and does not require the zero and nonzero effects to be well-separated. We also propose an efficient algorithm for estimation. Our simulations suggest satisfactory performance. An example from a diabetes study illustrates the real application. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 305-314 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1929246 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1929246 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:305-314 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1950734_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Erin E. Gabriel Author-X-Name-First: Erin E. Author-X-Name-Last: Gabriel Author-Name: Arvid Sjölander Author-X-Name-First: Arvid Author-X-Name-Last: Sjölander Author-Name: Michael C. Sachs Author-X-Name-First: Michael C. Author-X-Name-Last: Sachs Title: Nonparametric Bounds for Causal Effects in Imperfect Randomized Experiments Abstract: Nonignorable missingness and noncompliance can occur even in well-designed randomized experiments, making the intervention effect that the experiment was designed to estimate nonidentifiable. Nonparametric causal bounds provide a way to narrow the range of possible values for a nonidentifiable causal effect with minimal assumptions. We derive novel bounds for the causal risk difference for a binary outcome and intervention in randomized experiments with nonignorable missingness that is caused by a variety of mechanisms, with both perfect and imperfect compliance. We show that the so-called worst-case imputation, whereby all missing subjects on the intervention arm are assumed to have events and all missing subjects on the control or placebo arm are assumed to be event-free, can be too pessimistic in blinded studies with perfect compliance, and is not bounding the correct estimand with imperfect compliance. We illustrate the use of the proposed bounds in our motivating data example of peanut consumption on the development of peanut allergies in infants. We find that, even accounting for potentially nonignorable missingness and noncompliance, our derived bounds confirm that regular exposure to peanuts reduces the risk of development of peanut allergies, making the results of this study much more compelling. Journal: Journal of the American Statistical Association Pages: 684-692 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1950734 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1950734 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:684-692 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1933498_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Wanrong Zhu Author-X-Name-First: Wanrong Author-X-Name-Last: Zhu Author-Name: Xi Chen Author-X-Name-First: Xi Author-X-Name-Last: Chen Author-Name: Wei Biao Wu Author-X-Name-First: Wei Biao Author-X-Name-Last: Wu Title: Online Covariance Matrix Estimation in Stochastic Gradient Descent Abstract: The stochastic gradient descent (SGD) algorithm is widely used for parameter estimation, especially for huge datasets and online learning. While this recursive algorithm is popular for computation and memory efficiency, quantifying variability and randomness of the solutions has been rarely studied. This article aims at conducting statistical inference of SGD-based estimates in an online setting. In particular, we propose a fully online estimator for the covariance matrix of averaged SGD (ASGD) iterates only using the iterates from SGD. We formally establish our online estimator’s consistency and show that the convergence rate is comparable to offline counterparts. Based on the classic asymptotic normality results of ASGD, we construct asymptotically valid confidence intervals for model parameters. Upon receiving new observations, we can quickly update the covariance matrix estimate and the confidence intervals. This approach fits in an online setting and takes full advantage of SGD: efficiency in computation and memory. Journal: Journal of the American Statistical Association Pages: 393-404 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1933498 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1933498 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:393-404 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2123332_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Eleni Matechou Author-X-Name-First: Eleni Author-X-Name-Last: Matechou Author-Name: Raffaele Argiento Author-X-Name-First: Raffaele Author-X-Name-Last: Argiento Title: Capture-Recapture Models with Heterogeneous Temporary Emigration Abstract: We propose a novel approach for modeling capture-recapture (CR) data on open populations that exhibit temporary emigration, while also accounting for individual heterogeneity to allow for differences in visit patterns and capture probabilities between individuals. Our modeling approach combines changepoint processes—fitted using an adaptive approach—for inferring individual visits, with Bayesian mixture modeling—fitted using a nonparametric approach—for identifying clusters of individuals with similar visit patterns or capture probabilities. The proposed method is extremely flexible as it can be applied to any CR dataset and is not reliant upon specialized sampling schemes, such as Pollock’s robust design. We fit the new model to motivating data on salmon anglers collected annually at the Gaula river in Norway. Our results when analyzing data from the 2017, 2018, and 2019 seasons reveal two clusters of anglers—consistent across years—with substantially different visit patterns. Most anglers are allocated to the “occasional visitors” cluster, making infrequent and shorter visits with mean total length of stay at the river of around seven days, whereas there also exists a small cluster of “super visitors,” with regular and longer visits, with mean total length of stay of around 30 days in a season. Our estimate of the probability of catching salmon whilst at the river is more than three times higher than that obtained when using a model that does not account for temporary emigration, giving us a better understanding of the impact of fishing at the river. Finally, we discuss the effect of the COVID-19 pandemic on the angling population by modeling data from the 2020 season. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 56-69 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2022.2123332 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2123332 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:56-69 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1933499_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Francesco Denti Author-X-Name-First: Francesco Author-X-Name-Last: Denti Author-Name: Federico Camerlenghi Author-X-Name-First: Federico Author-X-Name-Last: Camerlenghi Author-Name: Michele Guindani Author-X-Name-First: Michele Author-X-Name-Last: Guindani Author-Name: Antonietta Mira Author-X-Name-First: Antonietta Author-X-Name-Last: Mira Title: A Common Atoms Model for the Bayesian Nonparametric Analysis of Nested Data Abstract: The use of large datasets for targeted therapeutic interventions requires new ways to characterize the heterogeneity observed across subgroups of a specific population. In particular, models for partially exchangeable data are needed for inference on nested datasets, where the observations are assumed to be organized in different units and some sharing of information is required to learn distinctive features of the units. In this manuscript, we propose a nested common atoms model (CAM) that is particularly suited for the analysis of nested datasets where the distributions of the units are expected to differ only over a small fraction of the observations sampled from each unit. The proposed CAM allows a two-layered clustering at the distributional and observational level and is amenable to scalable posterior inference through the use of a computationally efficient nested slice sampler algorithm. We further discuss how to extend the proposed modeling framework to handle discrete measurements, and we conduct posterior inference on a real microbiome dataset from a diet swap study to investigate how the alterations in intestinal microbiota composition are associated with different eating habits. We further investigate the performance of our model in capturing true distributional structures in the population by means of a simulation study. Journal: Journal of the American Statistical Association Pages: 405-416 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1933499 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1933499 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:405-416 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1938082_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Jie Zhou Author-X-Name-First: Jie Author-X-Name-Last: Zhou Author-Name: Will Wei Sun Author-X-Name-First: Will Wei Author-X-Name-Last: Sun Author-Name: Jingfei Zhang Author-X-Name-First: Jingfei Author-X-Name-Last: Zhang Author-Name: Lexin Li Author-X-Name-First: Lexin Author-X-Name-Last: Li Title: Partially Observed Dynamic Tensor Response Regression Abstract: In modern data science, dynamic tensor data prevail in numerous applications. An important task is to characterize the relationship between dynamic tensor datasets and external covariates. However, the tensor data are often only partially observed, rendering many existing methods inapplicable. In this article, we develop a regression model with a partially observed dynamic tensor as the response and external covariates as the predictor. We introduce the low-rankness, sparsity, and fusion structures on the regression coefficient tensor, and consider a loss function projected over the observed entries. We develop an efficient nonconvex alternating updating algorithm, and derive the finite-sample error bound of the actual estimator from each step of our optimization algorithm. Unobserved entries in the tensor response have imposed serious challenges. As a result, our proposal differs considerably in terms of estimation algorithm, regularity conditions, as well as theoretical properties, compared to the existing tensor completion or tensor response regression solutions. We illustrate the efficacy of our proposed method using simulations and two real applications, including a neuroimaging dementia study and a digital advertising study. Journal: Journal of the American Statistical Association Pages: 424-439 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1938082 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1938082 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:424-439 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1929248_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Shuhao Jiao Author-X-Name-First: Shuhao Author-X-Name-Last: Jiao Author-Name: Alexander Aue Author-X-Name-First: Alexander Author-X-Name-Last: Aue Author-Name: Hernando Ombao Author-X-Name-First: Hernando Author-X-Name-Last: Ombao Title: Functional Time Series Prediction Under Partial Observation of the Future Curve Abstract: Abstract–This article tackles one of the most fundamental goals in functional time series analysis which is to provide reliable predictions for future functions. Existing methods for predicting a complete future functional observation use only completely observed trajectories. We develop a new method, called partial functional prediction (PFP), which uses both completely observed trajectories and partial information (available partial data) on the trajectory to be predicted. The PFP method includes an automatic selection criterion for tuning parameters based on minimizing the prediction error, and the convergence rate of the PFP prediction is established. Simulation studies demonstrate that incorporating partially observed trajectory in the prediction outperforms existing methods with respect to mean squared prediction error. The PFP method is illustrated to be superior in the analysis of environmental data and traffic flow data. Journal: Journal of the American Statistical Association Pages: 315-326 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1929248 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1929248 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:315-326 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1933497_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Zheng Tracy Ke Author-X-Name-First: Zheng Tracy Author-X-Name-Last: Ke Author-Name: Yucong Ma Author-X-Name-First: Yucong Author-X-Name-Last: Ma Author-Name: Xihong Lin Author-X-Name-First: Xihong Author-X-Name-Last: Lin Title: Estimation of the Number of Spiked Eigenvalues in a Covariance Matrix by Bulk Eigenvalue Matching Analysis Abstract: The spiked covariance model has gained increasing popularity in high-dimensional data analysis. A fundamental problem is determination of the number of spiked eigenvalues, K. For estimation of K, most attention has focused on the use of top eigenvalues of sample covariance matrix, and there is little investigation into proper ways of using bulk eigenvalues to estimate K. We propose a principled approach to incorporating bulk eigenvalues in the estimation of K. Our method imposes a working model on the residual covariance matrix, which is assumed to be a diagonal matrix whose entries are drawn from a gamma distribution. Under this model, the bulk eigenvalues are asymptotically close to the quantiles of a fixed parametric distribution. This motivates us to propose a two-step method: the first step uses bulk eigenvalues to estimate parameters of this distribution, and the second step leverages these parameters to assist the estimation of K. The resulting estimator K̂ aggregates information in a large number of bulk eigenvalues. We show the consistency of K̂ under a standard spiked covariance model. We also propose a confidence interval estimate for K. Our extensive simulation studies show that the proposed method is robust and outperforms the existing methods in a range of scenarios. We apply the proposed method to analysis of a lung cancer microarray dataset and the 1000 Genomes dataset. Journal: Journal of the American Statistical Association Pages: 374-392 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1933497 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1933497 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:374-392 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1918554_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Wenxuan Zhong Author-X-Name-First: Wenxuan Author-X-Name-Last: Zhong Author-Name: Yiwen Liu Author-X-Name-First: Yiwen Author-X-Name-Last: Liu Author-Name: Peng Zeng Author-X-Name-First: Peng Author-X-Name-Last: Zeng Title: A Model-free Variable Screening Method Based on Leverage Score Abstract: With rapid advances in information technology, massive datasets are collected in all fields of science, such as biology, chemistry, and social science. Useful or meaningful information is extracted from these data often through statistical learning or model fitting. In massive datasets, both sample size and number of predictors can be large, in which case conventional methods face computational challenges. Recently, an innovative and effective sampling scheme based on leverage scores via singular value decompositions has been proposed to select rows of a design matrix as a surrogate of the full data in linear regression. Analogously, variable screening can be viewed as selecting rows of the design matrix. However, effective variable selection along this line of thinking remains elusive. In this article, we bridge this gap to propose a weighted leverage variable screening method by using both the left and right singular vectors of the design matrix. We show theoretically and empirically that the predictors selected using our method can consistently include true predictors not only for linear models but also for complicated general index models. Extensive simulation studies show that the weighted leverage screening method is highly computationally efficient and effective. We also demonstrate its success in identifying carcinoma related genes using spatial transcriptome data. Journal: Journal of the American Statistical Association Pages: 135-146 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1918554 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1918554 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:135-146 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1933496_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Hua Liu Author-X-Name-First: Hua Author-X-Name-Last: Liu Author-Name: Jinhong You Author-X-Name-First: Jinhong Author-X-Name-Last: You Author-Name: Jiguo Cao Author-X-Name-First: Jiguo Author-X-Name-Last: Cao Title: A Dynamic Interaction Semiparametric Function-on-Scalar Model Abstract: Motivated by recent work studying massive functional data, such as the COVID-19 data, we propose a new dynamic interaction semiparametric function-on-scalar (DISeF) model. The proposed model is useful to explore the dynamic interaction among a set of covariates and their effects on the functional response. The proposed model includes many important models investigated recently as special cases. By tensor product B-spline approximating the unknown bivariate coefficient functions, a three-step efficient estimation procedure is developed to iteratively estimate bivariate varying-coefficient functions, the vector of index parameters, and the covariance functions of random effects. We also establish the asymptotic properties of the estimators including the convergence rate and their asymptotic distributions. In addition, we develop a test statistic to check whether the dynamic interaction varies with time/spatial locations, and we prove the asymptotic normality of the test statistic. The finite sample performance of our proposed method and of the test statistic are investigated with several simulation studies. Our proposed DISeF model is also used to analyze the COVID-19 data and the ADNI data. In both applications, hypothesis testing shows that the bivariate varying-coefficient functions significantly vary with the index and the time/spatial locations. For instance, we find that the interaction effect of the population aging and the socio-economic covariates, such as the number of hospital beds, physicians, nurses per 1000 people and GDP per capita, on the COVID-19 mortality rate varies in different periods of the COVID-19 pandemic. The healthcare infrastructure index related to the COVID-19 mortality rate is also obtained for 141 countries estimated based on the proposed DISeF model. Journal: Journal of the American Statistical Association Pages: 360-373 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1933496 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1933496 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:360-373 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2161385_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Jörg Drechsler Author-X-Name-First: Jörg Author-X-Name-Last: Drechsler Title: Differential Privacy for Government Agencies—Are We There Yet? Abstract: Government agencies typically need to take potential risks of disclosure into account whenever they publish statistics based on their data or give external researchers access to collected data. In this context, the promise of formal privacy guarantees offered by concepts such as differential privacy seems to be the panacea enabling the agencies to quantify and control the privacy loss incurred by any data release exactly. Nevertheless, despite the excitement in academia and industry, most agencies—with the prominent exception of the U.S. Census Bureau—have been reluctant to even consider the concept for their data release strategy. This article discusses potential reasons for this. We argue that the requirements for implementing differential privacy approaches at government agencies are often fundamentally different from the requirements in industry. This raises many challenges and questions that still need to be addressed before the concept can be used as an overarching principle when sharing data with the public. The article does not offer any solutions to these challenges. Instead, we hope to stimulate some collaborative research efforts, as we believe that many of the problems can only be addressed by interdisciplinary collaborations. Journal: Journal of the American Statistical Association Pages: 761-773 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2022.2161385 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2161385 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:761-773 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1944874_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yisu Jia Author-X-Name-First: Yisu Author-X-Name-Last: Jia Author-Name: Stefanos Kechagias Author-X-Name-First: Stefanos Author-X-Name-Last: Kechagias Author-Name: James Livsey Author-X-Name-First: James Author-X-Name-Last: Livsey Author-Name: Robert Lund Author-X-Name-First: Robert Author-X-Name-Last: Lund Author-Name: Vladas Pipiras Author-X-Name-First: Vladas Author-X-Name-Last: Pipiras Title: Latent Gaussian Count Time Series Abstract: This article develops the theory and methods for modeling a stationary count time series via Gaussian transformations. The techniques use a latent Gaussian process and a distributional transformation to construct stationary series with very flexible correlation features that can have any prespecified marginal distribution, including the classical Poisson, generalized Poisson, negative binomial, and binomial structures. Gaussian pseudo-likelihood and implied Yule–Walker estimation paradigms, based on the autocovariance function of the count series, are developed via a new Hermite expansion. Particle filtering and sequential Monte Carlo methods are used to conduct likelihood estimation. Connections to state space models are made. Our estimation approaches are evaluated in a simulation study and the methods are used to analyze a count series of weekly retail sales. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 596-606 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1944874 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1944874 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:596-606 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1950003_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Arkajyoti Saha Author-X-Name-First: Arkajyoti Author-X-Name-Last: Saha Author-Name: Sumanta Basu Author-X-Name-First: Sumanta Author-X-Name-Last: Basu Author-Name: Abhirup Datta Author-X-Name-First: Abhirup Author-X-Name-Last: Datta Title: Random Forests for Spatially Dependent Data Abstract: Spatial linear mixed-models, consisting of a linear covariate effect and a Gaussian process (GP) distributed spatial random effect, are widely used for analyses of geospatial data. We consider the setting where the covariate effect is nonlinear. Random forests (RF) are popular for estimating nonlinear functions but applications of RF for spatial data have often ignored the spatial correlation. We show that this impacts the performance of RF adversely. We propose RF-GLS, a novel and well-principled extension of RF, for estimating nonlinear covariate effects in spatial mixed models where the spatial correlation is modeled using GP. RF-GLS extends RF in the same way generalized least squares (GLS) fundamentally extends ordinary least squares (OLS) to accommodate for dependence in linear models. RF becomes a special case of RF-GLS, and is substantially outperformed by RF-GLS for both estimation and prediction across extensive numerical experiments with spatially correlated data. RF-GLS can be used for functional estimation in other types of dependent data like time series. We prove consistency of RF-GLS for β-mixing dependent error processes that include the popular spatial Matérn GP. As a byproduct, we also establish, to our knowledge, the first consistency result for RF under dependence. We establish results of independent importance, including a general consistency result of GLS optimizers of data-driven function classes, and a uniform law of large number under β-mixing dependence with weaker assumptions. These new tools can be potentially useful for asymptotic analysis of other GLS-style estimators in nonparametric regression with dependent data. Journal: Journal of the American Statistical Association Pages: 665-683 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1950003 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1950003 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:665-683 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1928514_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yi Liu Author-X-Name-First: Yi Author-X-Name-Last: Liu Author-Name: Veronika Ročková Author-X-Name-First: Veronika Author-X-Name-Last: Ročková Title: Variable Selection Via Thompson Sampling Abstract: Abstract–Thompson sampling is a heuristic algorithm for the multi-armed bandit problem which has a long tradition in machine learning. The algorithm has a Bayesian spirit in the sense that it selects arms based on posterior samples of reward probabilities of each arm. By forging a connection between combinatorial binary bandits and spike-and-slab variable selection, we propose a stochastic optimization approach to subset selection called Thompson variable selection (TVS). TVS is a framework for interpretable machine learning which does not rely on the underlying model to be linear. TVS brings together Bayesian reinforcement and machine learning in order to extend the reach of Bayesian subset selection to nonparametric models and large datasets with very many predictors and/or very many observations. Depending on the choice of a reward, TVS can be deployed in offline as well as online setups with streaming data batches. Tailoring multiplay bandits to variable selection, we provide regret bounds without necessarily assuming that the arm mean rewards be unrelated. We show a very strong empirical performance on both simulated and real data. Unlike deterministic optimization methods for spike-and-slab variable selection, the stochastic nature makes TVS less prone to local convergence and thereby more robust. Journal: Journal of the American Statistical Association Pages: 287-304 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1928514 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1928514 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:287-304 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2123333_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yeonjoo Park Author-X-Name-First: Yeonjoo Author-X-Name-Last: Park Author-Name: Bo Li Author-X-Name-First: Bo Author-X-Name-Last: Li Author-Name: Yehua Li Author-X-Name-First: Yehua Author-X-Name-Last: Li Title: Crop Yield Prediction Using Bayesian Spatially Varying Coefficient Models with Functional Predictors Abstract: Reliable prediction for crop yield is crucial for economic planning, food security monitoring, and agricultural risk management. This study aims to develop a crop yield forecasting model at large spatial scales using meteorological variables closely related to crop growth. The influence of climate patterns on agricultural productivity can be spatially inhomogeneous due to local soil and environmental conditions. We propose a Bayesian spatially varying functional model (BSVFM) to predict county-level corn yield for five Midwestern states, based on annual precipitation and daily maximum and minimum temperature trajectories modeled as multivariate functional predictors. The proposed model accommodates spatial correlation and measurement errors of functional predictors, and respects the spatially heterogeneous relationship between the response and associated predictors by allowing the functional coefficients to vary over space. The model also incorporates a Bayesian variable selection device to further expand its capacity to accommodate spatial heterogeneity. The proposed method is demonstrated to outperform other highly competitive methods in corn yield prediction, owing to the flexibility of allowing spatial heterogeneity with spatially varying coefficients in our model. Our study provides further insights into understanding the impact of climate change on crop yield. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 70-83 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2022.2123333 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2123333 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:70-83 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1942013_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Ines Wilms Author-X-Name-First: Ines Author-X-Name-Last: Wilms Author-Name: Sumanta Basu Author-X-Name-First: Sumanta Author-X-Name-Last: Basu Author-Name: Jacob Bien Author-X-Name-First: Jacob Author-X-Name-Last: Bien Author-Name: David S. Matteson Author-X-Name-First: David S. Author-X-Name-Last: Matteson Title: Sparse Identification and Estimation of Large-Scale Vector AutoRegressive Moving Averages Abstract: The vector autoregressive moving average (VARMA) model is fundamental to the theory of multivariate time series; however, identifiability issues have led practitioners to abandon it in favor of the simpler but more restrictive vector autoregressive (VAR) model. We narrow this gap with a new optimization-based approach to VARMA identification built upon the principle of parsimony. Among all equivalent data-generating models, we use convex optimization to seek the parameterization that is simplest in a certain sense. A user-specified strongly convex penalty is used to measure model simplicity, and that same penalty is then used to define an estimator that can be efficiently computed. We establish consistency of our estimators in a double-asymptotic regime. Our nonasymptotic error bound analysis accommodates both model specification and parameter estimation steps, a feature that is crucial for studying large-scale VARMA algorithms. Our analysis also provides new results on penalized estimation of infinite-order VAR, and elastic net regression under a singular covariance structure of regressors, which may be of independent interest. We illustrate the advantage of our method over VAR alternatives on three real data examples. Journal: Journal of the American Statistical Association Pages: 571-582 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1942013 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1942013 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:571-582 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1938583_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Charles E. McCulloch Author-X-Name-First: Charles E. Author-X-Name-Last: McCulloch Author-Name: John M. Neuhaus Author-X-Name-First: John M. Author-X-Name-Last: Neuhaus Title: Improving Predictions When Interest Focuses on Extreme Random Effects Abstract: Statistical models that generate predicted random effects are widely used to evaluate the performance of and rank patients, physicians, hospitals and health plans from longitudinal and clustered data. Predicted random effects have been proven to outperform treating clusters as fixed effects (essentially a categorical predictor variable) and using standard regression models, on average. These predicted random effects are often used to identify extreme or outlying values, such as poorly performing hospitals or patients with rapid declines in their health. When interest focuses on the extremes rather than performance on average, there has been no systematic investigation of best approaches. We develop novel methods for prediction of extreme values, evaluate their performance, and illustrate their application using data from the Osteoarthritis Initiative to predict walking speed in older adults. The new methods substantially outperform standard random and fixed-effects approaches for extreme values. Journal: Journal of the American Statistical Association Pages: 504-513 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1938583 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1938583 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:504-513 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1952877_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Manuel Arellano Author-X-Name-First: Manuel Author-X-Name-Last: Arellano Author-Name: Stéphane Bonhomme Author-X-Name-First: Stéphane Author-X-Name-Last: Bonhomme Title: Recovering Latent Variables by Matching Abstract: We propose an optimal-transport-based matching method to nonparametrically estimate linear models with independent latent variables. The method consists in generating pseudo-observations from the latent variables, so that the Euclidean distance between the model’s predictions and their matched counterparts in the data is minimized. We show that our nonparametric estimator is consistent, and we document that it performs well in simulated data. We apply this method to study the cyclicality of permanent and transitory income shocks in the Panel Study of Income Dynamics. We find that the dispersion of income shocks is approximately acyclical, whereas the skewness of permanent shocks is procyclical. By comparison, we find that the dispersion and skewness of shocks to hourly wages vary little with the business cycle. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 693-706 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1952877 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1952877 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:693-706 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2126779_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Matthew Simpson Author-X-Name-First: Matthew Author-X-Name-Last: Simpson Author-Name: Scott H. Holan Author-X-Name-First: Scott H. Author-X-Name-Last: Holan Author-Name: Christopher K. Wikle Author-X-Name-First: Christopher K. Author-X-Name-Last: Wikle Author-Name: Jonathan R. Bradley Author-X-Name-First: Jonathan R. Author-X-Name-Last: Bradley Title: Interpolating Population Distributions using Public-Use Data: An Application to Income Segregation using American Community Survey Data Abstract: The presence of income inequality is an important problem to demographers, policy makers, economists, and social scientists. A causal link has been hypothesized between income inequality and income segregation, which measures how much households with similar incomes cluster. The information theory index is used to measure income segregation, however, critics have suggested the divergence index instead. Motivated by this, we construct both indices using American Community Survey (ACS) estimates of features of the income distribution. Since the elimination of the decennial census long form, methods of computing these indices must be updated to interpolate ACS estimates and account for survey error. We propose a novel model-based method to do this which improves on previous approaches by using more types of estimates, and by providing uncertainty quantification. We apply this method to estimate U.S. census tract-level income distributions, and in turn use these to construct both income segregation indices. We find major differences between the two indices and find evidence that the information index underestimates the relationship between income inequality and income segregation. The literature suggests interventions designed to reduce income inequality by reducing income segregation, or vice versa, so using the information index implicitly understates the value of these interventions. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 84-96 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2022.2126779 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2126779 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:84-96 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2139264_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Zhengwu Zhang Author-X-Name-First: Zhengwu Author-X-Name-Last: Zhang Author-Name: Yuexuan Wu Author-X-Name-First: Yuexuan Author-X-Name-Last: Wu Author-Name: Di Xiong Author-X-Name-First: Di Author-X-Name-Last: Xiong Author-Name: Joseph G. Ibrahim Author-X-Name-First: Joseph G. Author-X-Name-Last: Ibrahim Author-Name: Anuj Srivastava Author-X-Name-First: Anuj Author-X-Name-Last: Srivastava Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Title: Rejoinder: LESA: Longitudinal Elastic Shape Analysis of Brain Subcortical Structures Journal: Journal of the American Statistical Association Pages: 25-28 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2022.2139264 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2139264 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:25-28 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2128806_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Tianhai Zu Author-X-Name-First: Tianhai Author-X-Name-Last: Zu Author-Name: Heng Lian Author-X-Name-First: Heng Author-X-Name-Last: Lian Author-Name: Brittany Green Author-X-Name-First: Brittany Author-X-Name-Last: Green Author-Name: Yan Yu Author-X-Name-First: Yan Author-X-Name-Last: Yu Title: Ultra-High Dimensional Quantile Regression for Longitudinal Data: An Application to Blood Pressure Analysis Abstract: Despite major advances in research and treatment, identifying important genotype risk factors for high blood pressure remains challenging. Traditional genome-wide association studies (GWAS) focus on one single nucleotide polymorphism (SNP) at a time. We aim to select among over half a million SNPs along with time-varying phenotype variables via simultaneous modeling and variable selection, focusing on the most dangerous blood pressure levels at high quantiles. Taking advantage of rich data from a large-scale public health study, we develop and apply a novel quantile penalized generalized estimating equations (GEE) approach, incorporating several key aspects including ultra-high dimensional genetic SNPs, the longitudinal nature of blood pressure measurements, time-varying covariates, and conditional high quantiles of blood pressure. Importantly, we identify interesting new SNPs for high blood pressure. Besides, we find blood pressure levels are likely heterogeneous, where the important risk factors identified differ among quantiles. This comprehensive picture of conditional quantiles of blood pressure can allow more insights and targeted treatments. We provide an efficient computational algorithm and prove consistency, asymptotic normality, and the oracle property for the quantile penalized GEE estimators with ultra-high dimensional predictors. Moreover, we establish model-selection consistency for high-dimensional BIC. Simulation studies show the promise of the proposed approach. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 97-108 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2022.2128806 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2128806 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:97-108 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1948419_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Fangzheng Xie Author-X-Name-First: Fangzheng Author-X-Name-Last: Xie Author-Name: Yanxun Xu Author-X-Name-First: Yanxun Author-X-Name-Last: Xu Title: Efficient Estimation for Random Dot Product Graphs via a One-Step Procedure Abstract: We propose a one-step procedure to estimate the latent positions in random dot product graphs efficiently. Unlike the classical spectral-based methods, the proposed one-step procedure takes advantage of both the low-rank structure of the expected adjacency matrix and the Bernoulli likelihood information of the sampling model simultaneously. We show that for each vertex, the corresponding row of the one-step estimator (OSE) converges to a multivariate normal distribution after proper scaling and centering up to an orthogonal transformation, with an efficient covariance matrix. The initial estimator for the one-step procedure needs to satisfy the so-called approximate linearization property. The OSE improves the commonly adopted spectral embedding methods in the following sense: Globally for all vertices, it yields an asymptotic sum of squares error no greater than those of the spectral methods, and locally for each vertex, the asymptotic covariance matrix of the corresponding row of the OSE dominates those of the spectral embeddings in spectra. The usefulness of the proposed one-step procedure is demonstrated via numerical examples and the analysis of a real-world Wikipedia graph dataset. Journal: Journal of the American Statistical Association Pages: 651-664 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1948419 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1948419 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:651-664 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2105703_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Jiayin Zheng Author-X-Name-First: Jiayin Author-X-Name-Last: Zheng Author-Name: Xinyuan Dong Author-X-Name-First: Xinyuan Author-X-Name-Last: Dong Author-Name: Christina C. Newton Author-X-Name-First: Christina C. Author-X-Name-Last: Newton Author-Name: Li Hsu Author-X-Name-First: Li Author-X-Name-Last: Hsu Title: A Generalized Integration Approach to Association Analysis with Multi-category Outcome: An Application to a Tumor Sequencing Study of Colorectal Cancer and Smoking Abstract: Cancer is a heterogeneous disease, and rapid progress in sequencing and -omics technologies has enabled researchers to characterize tumors comprehensively. This has stimulated an intensive interest in studying how risk factors are associated with various tumor heterogeneous features. The Cancer Prevention Study-II (CPS-II) cohort is one of the largest prospective studies, particularly valuable for elucidating associations between cancer and risk factors. In this article, we investigate the association of smoking with novel colorectal tumor markers obtained from targeted sequencing. However, due to cost and logistic difficulties, only a limited number of tumors can be assayed, which limits our capability for studying these associations. Meanwhile, there are extensive studies for assessing the association of smoking with overall cancer risk and established colorectal tumor markers. Importantly, such summary information is readily available from the literature. By linking this summary information to parameters of interest with proper constraints, we develop a generalized integration approach for polytomous logistic regression model with outcome characterized by tumor features. The proposed approach gains the efficiency through maximizing the joint likelihood of individual-level tumor data and external summary information under the constraints that narrow the parameter searching space. We apply the proposed method to the CPS-II data and identify the association of smoking with colorectal cancer risk differing by the mutational status of APC and RNF43 genes, neither of which is identified by the conventional analysis of CPS-II individual data only. These results help better understand the role of smoking in the etiology of colorectal cancer. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 29-42 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2022.2105703 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2105703 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:29-42 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2174869_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Kwun Chuen Gary Chan Author-X-Name-First: Kwun Chuen Gary Author-X-Name-Last: Chan Title: Handbook of Measurement Error Models Journal: Journal of the American Statistical Association Pages: 776-777 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2023.2174869 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2174869 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:776-777 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1942014_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Katarzyna Reluga Author-X-Name-First: Katarzyna Author-X-Name-Last: Reluga Author-Name: María-José Lombardía Author-X-Name-First: María-José Author-X-Name-Last: Lombardía Author-Name: Stefan Sperlich Author-X-Name-First: Stefan Author-X-Name-Last: Sperlich Title: Simultaneous Inference for Empirical Best Predictors With a Poverty Study in Small Areas Abstract: Today, generalized linear mixed models (GLMM) are broadly used in many fields. However, the development of tools for performing simultaneous inference has been largely neglected in this domain. A framework for joint inference is indispensable to carry out statistically valid multiple comparisons of parameters of interest between all or several clusters. We therefore develop simultaneous confidence intervals and multiple testing procedures for empirical best predictors under GLMM. In addition, we implement our methodology to study widely employed examples of mixed models, that is, the unit-level binomial, the area-level Poisson-gamma and the area-level Poisson-lognormal mixed models. The asymptotic results are accompanied by extensive simulations. A case study on predicting poverty rates illustrates applicability and advantages of our simultaneous inference tools. Journal: Journal of the American Statistical Association Pages: 583-595 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1942014 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1942014 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:583-595 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2123334_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Daiwei Zhang Author-X-Name-First: Daiwei Author-X-Name-Last: Zhang Author-Name: Jian Kang Author-X-Name-First: Jian Author-X-Name-Last: Kang Title: Discussion of “LESA: Longitudinal Elastic Shape Analysis of Brain Subcortical Structures” Journal: Journal of the American Statistical Association Pages: 22-24 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2022.2123334 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2123334 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:22-24 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1938084_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Ye Tian Author-X-Name-First: Ye Author-X-Name-Last: Tian Author-Name: Yang Feng Author-X-Name-First: Yang Author-X-Name-Last: Feng Title: RaSE: A Variable Screening Framework via Random Subspace Ensembles Abstract: Variable screening methods have been shown to be effective in dimension reduction under the ultra-high dimensional setting. Most existing screening methods are designed to rank the predictors according to their individual contributions to the response. As a result, variables that are marginally independent but jointly dependent with the response could be missed. In this work, we propose a new framework for variable screening, random subspace ensemble (RaSE), which works by evaluating the quality of random subspaces that may cover multiple predictors. This new screening framework can be naturally combined with any subspace evaluation criterion, which leads to an array of screening methods. The framework is capable to identify signals with no marginal effect or with high-order interaction effects. It is shown to enjoy the sure screening property and rank consistency. We also develop an iterative version of RaSE screening with theoretical support. Extensive simulation studies and real-data analysis show the effectiveness of the new screening framework. Journal: Journal of the American Statistical Association Pages: 457-468 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1938084 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1938084 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:457-468 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1947307_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Guangyu Yang Author-X-Name-First: Guangyu Author-X-Name-Last: Yang Author-Name: Baqun Zhang Author-X-Name-First: Baqun Author-X-Name-Last: Zhang Author-Name: Min Zhang Author-X-Name-First: Min Author-X-Name-Last: Zhang Title: Estimation of Knots in Linear Spline Models Abstract: The linear spline model is able to accommodate nonlinear effects while allowing for an easy interpretation. It has significant applications in studying threshold effects and change-points. However, its application in practice has been limited by the lack of both rigorously studied and computationally convenient method for estimating knots. A key difficulty in estimating knots lies in the nondifferentiability. In this article, we study influence functions of regular and asymptotically linear estimators for linear spline models using the semiparametric theory. Based on the theoretical development, we propose a simple semismooth estimating equation approach to circumvent the nondifferentiability issue using modified derivatives, in contrast to the previous smoothing-based methods. Without relying on any smoothing parameters, the proposed method is computationally convenient. To further improve numerical stability, a two-step algorithm taking advantage of the analytic solution available when knots are known is developed to solve the proposed estimating equation. Consistency and asymptotic normality are rigorously derived using the empirical process theory. Simulation studies have shown that the two-step algorithm performs well in terms of both statistical and computational properties and improves over existing methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 639-650 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1947307 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1947307 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:639-650 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1941054_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: María F. Gil–Leyva Author-X-Name-First: María F. Author-X-Name-Last: Gil–Leyva Author-Name: Ramsés H. Mena Author-X-Name-First: Ramsés H. Author-X-Name-Last: Mena Title: Stick-Breaking Processes With Exchangeable Length Variables Abstract: Our object of study is the general class of stick-breaking processes with exchangeable length variables. These generalize well-known Bayesian nonparametric priors in an unexplored direction. We give conditions to assure the respective species sampling process is proper and the corresponding prior has full support. For a rich subclass we explain how, by tuning a single [0,1]-valued parameter, the stochastic ordering of the weights can be modulated, and Dirichlet and Geometric priors can be recovered. A general formula for the distribution of the latent allocation variables is derived and an MCMC algorithm is proposed for density estimation purposes. Journal: Journal of the American Statistical Association Pages: 537-550 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1941054 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1941054 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:537-550 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1924178_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Kuang-Yao Lee Author-X-Name-First: Kuang-Yao Author-X-Name-Last: Lee Author-Name: Dingjue Ji Author-X-Name-First: Dingjue Author-X-Name-Last: Ji Author-Name: Lexin Li Author-X-Name-First: Lexin Author-X-Name-Last: Li Author-Name: Todd Constable Author-X-Name-First: Todd Author-X-Name-Last: Constable Author-Name: Hongyu Zhao Author-X-Name-First: Hongyu Author-X-Name-Last: Zhao Title: Conditional Functional Graphical Models Abstract: Graphical modeling of multivariate functional data is becoming increasingly important in a wide variety of applications. The changes of graph structure can often be attributed to external variables, such as the diagnosis status or time, the latter of which gives rise to the problem of dynamic graphical modeling. Most existing methods focus on estimating the graph by aggregating samples, but largely ignore the subject-level heterogeneity due to the external variables. In this article, we introduce a conditional graphical model for multivariate random functions, where we treat the external variables as conditioning set, and allow the graph structure to vary with the external variables. Our method is built on two new linear operators, the conditional precision operator and the conditional partial correlation operator, which extend the precision matrix and the partial correlation matrix to both the conditional and functional settings. We show that their nonzero elements can be used to characterize the conditional graphs, and develop the corresponding estimators. We establish the uniform convergence of the proposed estimators and the consistency of the estimated graph, while allowing the graph size to grow with the sample size, and accommodating both completely and partially observed data. We demonstrate the efficacy of the method through both simulations and a study of brain functional connectivity network. Journal: Journal of the American Statistical Association Pages: 257-271 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1924178 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1924178 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:257-271 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1953506_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Edward McFowland Author-X-Name-First: Edward Author-X-Name-Last: McFowland Author-Name: Cosma Rohilla Shalizi Author-X-Name-First: Cosma Rohilla Author-X-Name-Last: Shalizi Title: Estimating Causal Peer Influence in Homophilous Social Networks by Inferring Latent Locations Abstract: Social influence cannot be identified from purely observational data on social networks, because such influence is generically confounded with latent homophily, that is, with a node’s network partners being informative about the node’s attributes and therefore its behavior. If the network grows according to either a latent community (stochastic block) model, or a continuous latent space model, then latent homophilous attributes can be consistently estimated from the global pattern of social ties. We show that, for common versions of those two network models, these estimates are so informative that controlling for estimated attributes allows for asymptotically unbiased and consistent estimation of social-influence effects in linear models. In particular, the bias shrinks at a rate that directly reflects how much information the network provides about the latent attributes. These are the first results on the consistent nonexperimental estimation of social-influence effects in the presence of latent homophily, and we discuss the prospects for generalizing them. Journal: Journal of the American Statistical Association Pages: 707-718 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1953506 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1953506 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:707-718 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1923509_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yingying Dong Author-X-Name-First: Yingying Author-X-Name-Last: Dong Author-Name: Ying-Ying Lee Author-X-Name-First: Ying-Ying Author-X-Name-Last: Lee Author-Name: Michael Gou Author-X-Name-First: Michael Author-X-Name-Last: Gou Title: Regression Discontinuity Designs With a Continuous Treatment Abstract: The standard regression discontinuity (RD) design deals with a binary treatment. Many empirical applications of RD designs involve continuous treatments. This article establishes identification and robust bias-corrected inference for such RD designs. Causal identification is achieved by using any changes in the distribution of the continuous treatment at the RD threshold (including the usual mean change as a special case). We discuss a double-robust identification approach and propose an estimand that incorporates the standard fuzzy RD estimand as a special case. Applying the proposed approach, we estimate the impacts of bank capital on bank failure in the pre-Great Depression era in the United States. Our RD design takes advantage of the minimum capital requirements, which change discontinuously with town size. Journal: Journal of the American Statistical Association Pages: 208-221 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1923509 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1923509 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:208-221 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2110876_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Xiaoyu Song Author-X-Name-First: Xiaoyu Author-X-Name-Last: Song Author-Name: Jiayi Ji Author-X-Name-First: Jiayi Author-X-Name-Last: Ji Author-Name: Pei Wang Author-X-Name-First: Pei Author-X-Name-Last: Wang Title: iProMix: A Mixture Model for Studying the Function of ACE2 based on Bulk Proteogenomic Data Abstract: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused over six million deaths in the ongoing COVID-19 pandemic. SARS-CoV-2 uses ACE2 protein to enter human cells, raising a pressing need to characterize proteins/pathways interacted with ACE2. Large-scale proteomic profiling technology is not mature at single-cell resolution to examine the protein activities in disease-relevant cell types. We propose iProMix, a novel statistical framework to identify epithelial-cell specific associations between ACE2 and other proteins/pathways with bulk proteomic data. iProMix decomposes the data and models cell type-specific conditional joint distribution of proteins through a mixture model. It improves cell-type composition estimation from prior input, and uses a nonparametric inference framework to account for uncertainty of cell-type proportion estimates in hypothesis test. Simulations demonstrate iProMix has well-controlled false discovery rates and favorable powers in nonasymptotic settings. We apply iProMix to the proteomic data of 110 (tumor-adjacent) normal lung tissue samples from the Clinical Proteomic Tumor Analysis Consortium lung adenocarcinoma study, and identify interferon α/γ response pathways as the most significant pathways associated with ACE2 protein abundances in epithelial cells. Strikingly, the association direction is sex-specific. This result casts light on the sex difference of COVID-19 incidences and outcomes, and motivates sex-specific evaluation for interferon therapies. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 43-55 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2022.2110876 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2110876 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:43-55 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1941052_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Lingzhu Li Author-X-Name-First: Lingzhu Author-X-Name-Last: Li Author-Name: Xuehu Zhu Author-X-Name-First: Xuehu Author-X-Name-Last: Zhu Author-Name: Lixing Zhu Author-X-Name-First: Lixing Author-X-Name-Last: Zhu Title: Adaptive-to-Model Hybrid of Tests for Regressions Abstract: In model checking for regressions, nonparametric estimation-based tests usually have tractable limiting null distributions and are sensitive to oscillating alternative models, but suffer from the curse of dimensionality. In contrast, empirical process-based tests can, at the fastest possible rate, detect local alternatives distinct from the null model, yet are less sensitive to oscillating alternatives and rely on Monte Carlo approximation for critical value determination, which is costly in computation. We propose an adaptive-to-model hybrid of moment and conditional moment-based tests to fully inherit the merits of these two types of tests and avoid the shortcomings. Further, such a hybrid makes nonparametric estimation-based tests, under the alternatives, also share the merits of existing empirical process-based tests. The methodology can be readily applied to other kinds of data and construction of other hybrids. As a by-product in sufficient dimension reduction field, a study on residual-related central mean subspace and central subspace for model adaptation is devoted to showing when alternative models can be indicated and when cannot. Numerical studies are conducted to verify the powerfulness of the proposed test. Journal: Journal of the American Statistical Association Pages: 514-523 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1941052 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1941052 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:514-523 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1923510_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Xin Xing Author-X-Name-First: Xin Author-X-Name-Last: Xing Author-Name: Zhigen Zhao Author-X-Name-First: Zhigen Author-X-Name-Last: Zhao Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Title: Controlling False Discovery Rate Using Gaussian Mirrors Abstract: Simultaneously, finding multiple influential variables and controlling the false discovery rate (FDR) for linear regression models is a fundamental problem. We here propose the Gaussian Mirror (GM) method, which creates for each predictor variable a pair of mirror variables by adding and subtracting a randomly generated Gaussian perturbation, and proceeds with a certain regression method, such as the ordinary least-square or the Lasso (the mirror variables can also be created after selection). The mirror variables naturally lead to test statistics effective for controlling the FDR. Under a mild assumption on the dependence among the covariates, we show that the FDR can be controlled at any designated level asymptotically. We also demonstrate through extensive numerical studies that the GM method is more powerful than many existing methods for selecting relevant variables subject to FDR control, especially for cases when the covariates are highly correlated and the influential variables are not overly sparse. Journal: Journal of the American Statistical Association Pages: 222-241 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1923510 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1923510 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:222-241 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1923511_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Kosuke Imai Author-X-Name-First: Kosuke Author-X-Name-Last: Imai Author-Name: Michael Lingzhi Li Author-X-Name-First: Michael Lingzhi Author-X-Name-Last: Li Title: Experimental Evaluation of Individualized Treatment Rules Abstract: The increasing availability of individual-level data has led to numerous applications of individualized (or personalized) treatment rules (ITRs). Policy makers often wish to empirically evaluate ITRs and compare their relative performance before implementing them in a target population. We propose a new evaluation metric, the population average prescriptive effect (PAPE). The PAPE compares the performance of ITR with that of non-individualized treatment rule, which randomly treats the same proportion of units. Averaging the PAPE over a range of budget constraints yields our second evaluation metric, the area under the prescriptive effect curve (AUPEC). The AUPEC represents an overall performance measure for evaluation, like the area under the receiver and operating characteristic curve (AUROC) does for classification, and is a generalization of the QINI coefficient used in uplift modeling. We use Neyman’s repeated sampling framework to estimate the PAPE and AUPEC and derive their exact finite-sample variances based on random sampling of units and random assignment of treatment. We extend our methodology to a common setting, in which the same experimental data are used to both estimate and evaluate ITRs. In this case, our variance calculation incorporates the additional uncertainty due to random splits of data used for cross-validation. The proposed evaluation metrics can be estimated without requiring modeling assumptions, asymptotic approximation, or resampling methods. As a result, it is applicable to any ITR including those based on complex machine learning algorithms. The open-source software package is available for implementing the proposed methodology. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 242-256 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1923511 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1923511 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:242-256 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1938581_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Kun Zhou Author-X-Name-First: Kun Author-X-Name-Last: Zhou Author-Name: Ker-Chau Li Author-X-Name-First: Ker-Chau Author-X-Name-Last: Li Author-Name: Qing Zhou Author-X-Name-First: Qing Author-X-Name-Last: Zhou Title: Honest Confidence Sets for High-Dimensional Regression by Projection and Shrinkage Abstract: The issue of honesty in constructing confidence sets arises in nonparametric regression. While optimal rate in nonparametric estimation can be achieved and utilized to construct sharp confidence sets, severe degradation of confidence level often happens after estimating the degree of smoothness. Similarly, for high-dimensional regression, oracle inequalities for sparse estimators could be utilized to construct sharp confidence sets. Yet, the degree of sparsity itself is unknown and needs to be estimated, which causes the honesty problem. To resolve this issue, we develop a novel method to construct honest confidence sets for sparse high-dimensional linear regression. The key idea in our construction is to separate signals into a strong and a weak group, and then construct confidence sets for each group separately. This is achieved by a projection and shrinkage approach, the latter implemented via Stein estimation and the associated Stein unbiased risk estimate. Our confidence set is honest over the full parameter space without any sparsity constraints, while its size adapts to the optimal rate of n−1/4 when the true parameter is indeed sparse. Moreover, under some form of a separation assumption between the strong and weak signals, the diameter of our confidence set can achieve a faster rate than existing methods. Through extensive numerical comparisons on both simulated and real data, we demonstrate that our method outperforms other competitors with big margins for finite samples, including oracle methods built upon the true sparsity of the underlying model. Journal: Journal of the American Statistical Association Pages: 469-488 Issue: 541 Volume: 118 Year: 2023 Month: 1 X-DOI: 10.1080/01621459.2021.1938581 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1938581 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:469-488 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1962328_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Francesca R. Crucinio Author-X-Name-First: Francesca R. Author-X-Name-Last: Crucinio Author-Name: Arnaud Doucet Author-X-Name-First: Arnaud Author-X-Name-Last: Doucet Author-Name: Adam M. Johansen Author-X-Name-First: Adam M. Author-X-Name-Last: Johansen Title: A Particle Method for Solving Fredholm Equations of the First Kind Abstract: Fredholm integral equations of the first kind are the prototypical example of ill-posed linear inverse problems. They model, among other things, reconstruction of distorted noisy observations and indirect density estimation and also appear in instrumental variable regression. However, their numerical solution remains a challenging problem. Many techniques currently available require a preliminary discretization of the domain of the solution and make strong assumptions about its regularity. For example, the popular expectation maximization smoothing (EMS) scheme requires the assumption of piecewise constant solutions which is inappropriate for most applications. We propose here a novel particle method that circumvents these two issues. This algorithm can be thought of as a Monte Carlo approximation of the EMS scheme which not only performs an adaptive stochastic discretization of the domain but also results in smooth approximate solutions. We analyze the theoretical properties of the EMS iteration and of the corresponding particle algorithm. Compared to standard EMS, we show experimentally that our novel particle method provides state-of-the-art performance for realistic systems, including motion deblurring and reconstruction of cross-section images of the brain from positron emission tomography. Journal: Journal of the American Statistical Association Pages: 937-947 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1962328 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1962328 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:937-947 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2152342_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Wei Zhong Author-X-Name-First: Wei Author-X-Name-Last: Zhong Author-Name: Chen Qian Author-X-Name-First: Chen Author-X-Name-Last: Qian Author-Name: Wanjun Liu Author-X-Name-First: Wanjun Author-X-Name-Last: Liu Author-Name: Liping Zhu Author-X-Name-First: Liping Author-X-Name-Last: Zhu Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Title: Feature Screening for Interval-Valued Response with Application to Study Association between Posted Salary and Required Skills Abstract: It is important to quantify the differences in returns to skills using the online job advertisements data, which have attracted great interest in both labor economics and statistics fields. In this article, we study the relationship between the posted salary and the job requirements in online labor markets. There are two challenges to deal with. First, the posted salary is always presented in an interval-valued form, for example, 5k–10k yuan per month. Simply taking the mid-point or the lower bound as the alternative for salary may result in biased estimators. Second, the number of the potential skill words as predictors generated from the job advertisements by word segmentation is very large and many of them may not contribute to the salary. To this end, we propose a new feature screening method, Absolute Distribution Difference Sure Independence Screening (ADD-SIS), to select important skill words for the interval-valued response. The marginal utility for feature screening is based on the difference of estimated distribution functions via nonparametric maximum likelihood estimation, which sufficiently uses the interval information. It is model-free and robust to outliers. Numerical simulations show that the new method using the interval information is more efficient to select important predictors than the methods only based on the single points of the intervals. In the real data application, we study the text data of job advertisements for data scientists and data analysts in a major China’s online job posting website, and explore the important skill words for the salary. We find that the skill words like optimization, long short-term memory (LSTM), convolutional neural networks (CNN), collaborative filtering, are positively correlated with the salary while the words like Excel, Office, data collection, may negatively contribute to the salary. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 805-817 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2022.2152342 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2152342 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:805-817 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1996376_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Xin-Bing Kong Author-X-Name-First: Xin-Bing Author-X-Name-Last: Kong Author-Name: Jin-Guan Lin Author-X-Name-First: Jin-Guan Author-X-Name-Last: Lin Author-Name: Cheng Liu Author-X-Name-First: Cheng Author-X-Name-Last: Liu Author-Name: Guang-Ying Liu Author-X-Name-First: Guang-Ying Author-X-Name-Last: Liu Title: Discrepancy Between Global and Local Principal Component Analysis on Large-Panel High-Frequency Data Abstract: In this article, we study the discrepancy between the global principal component analysis (GPCA) and local principal component analysis (LPCA) in recovering the common components of a large-panel high-frequency data. We measure the discrepancy by the total sum of squared differences between common components reconstructed from GPCA and LPCA. The asymptotic distribution of the discrepancy measure is provided when the factor space is time invariant as the dimension p and sample size n tend to infinity simultaneously. Alternatively when the factor space changes, the discrepancy measure explodes under some mild signal condition on the magnitude of time-variation of the factor space. We apply the theory to test the invariance in time of the factor space. The test performs well in controlling the Type I error and detecting time-varying factor spaces. This is checked by extensive simulation studies. A real data analysis provides strong evidences that the factor space is always time-varying within a time span longer than one week. Journal: Journal of the American Statistical Association Pages: 1333-1344 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1996376 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1996376 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1333-1344 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1970570_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Le Zhou Author-X-Name-First: Le Author-X-Name-Last: Zhou Author-Name: Hui Zou Author-X-Name-First: Hui Author-X-Name-Last: Zou Title: Cross-Fitted Residual Regression for High-Dimensional Heteroscedasticity Pursuit Abstract: There is a vast amount of work on high-dimensional regression. The common starting point for the existing theoretical work is to assume the data generating model is a homoscedastic linear regression model with some sparsity structure. In reality the homoscedasticity assumption is often violated, and hence understanding the heteroscedasticity of the data is of critical importance. In this article we systematically study the estimation of a high-dimensional heteroscedastic regression model. In particular, the emphasis is on how to detect and estimate the heteroscedasticity effects reliably and efficiently. To this end, we propose a cross-fitted residual regression approach and prove the resulting estimator is selection consistent for heteroscedasticity effects and establish its rates of convergence. Our estimator has tuning parameters to be determined by the data in practice. We propose a novel high-dimensional BIC for tuning parameter selection and establish its consistency. This is the first high-dimensional BIC result under heteroscedasticity. The theoretical analysis is more involved in order to handle heteroscedasticity, and we develop a couple of interesting new concentration inequalities that are of independent interests. Journal: Journal of the American Statistical Association Pages: 1056-1065 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1970570 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1970570 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1056-1065 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2133718_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Jin-Hong Du Author-X-Name-First: Jin-Hong Author-X-Name-Last: Du Author-Name: Yifeng Guo Author-X-Name-First: Yifeng Author-X-Name-Last: Guo Author-Name: Xueqin Wang Author-X-Name-First: Xueqin Author-X-Name-Last: Wang Title: High-Dimensional Portfolio Selection with Cardinality Constraints Abstract: The expanding number of assets offers more opportunities for investors but poses new challenges for modern portfolio management (PM). As a central plank of PM, portfolio selection by expected utility maximization (EUM) faces uncontrollable estimation and optimization errors in ultrahigh-dimensional scenarios. Past strategies for high-dimensional PM mainly concern only large-cap companies and select many stocks, making PM impractical. We propose a sample-average-approximation-based portfolio strategy to tackle the difficulties above with cardinality constraints. Our strategy bypasses the estimation of mean and covariance, the Chinese walls in high-dimensional scenarios. Empirical results on S&P 500 and Russell 2000 show that an appropriate number of carefully chosen assets leads to better out-of-sample mean-variance efficiency. On Russell 2000, our best portfolio profits as much as the equally weighted portfolio but reduces the maximum drawdown and the average number of assets by 10% and 90%, respectively. The flexibility and the stability of incorporating factor signals for augmenting out-of-sample performances are also demonstrated. Our strategy balances the tradeoff among the return, the risk, and the number of assets with cardinality constraints. Therefore, we provide a theoretically sound and computationally efficient strategy to make PM practical in the growing global financial market. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 779-791 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2022.2133718 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2133718 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:779-791 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1987920_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Ying Hung Author-X-Name-First: Ying Author-X-Name-Last: Hung Author-Name: Li-Hsiang Lin Author-X-Name-First: Li-Hsiang Author-X-Name-Last: Lin Author-Name: C. F. Jeff Wu Author-X-Name-First: C. F. Jeff Author-X-Name-Last: Wu Title: Optimal Simulator Selection Abstract: Computer simulators are widely used for the study of complex systems. In many applications, there are multiple simulators available with different scientific interpretations of the underlying mechanism, and the goal is to identify an optimal simulator based on the observed physical experiments. To achieve the goal, we propose a selection criterion based on leave-one-out cross-validation. This criterion consists of a goodness-of-fit measure and a generalized degrees of freedom penalizing the simulator sensitivity to perturbations in the physical observations. Asymptotic properties of the selected optimal simulator are discussed. It is shown that the proposed procedure includes a conventional calibration method as a special case. The finite sample performance of the proposed procedure is demonstrated through numerical examples. In the application of cell biology, an optimal simulator is selected, which can shed light on the T cell recognition mechanism in the human immune system. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1264-1271 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1987920 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1987920 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1264-1271 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1996378_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Jiangzhou Wang Author-X-Name-First: Jiangzhou Author-X-Name-Last: Wang Author-Name: Jingfei Zhang Author-X-Name-First: Jingfei Author-X-Name-Last: Zhang Author-Name: Binghui Liu Author-X-Name-First: Binghui Author-X-Name-Last: Liu Author-Name: Ji Zhu Author-X-Name-First: Ji Author-X-Name-Last: Zhu Author-Name: Jianhua Guo Author-X-Name-First: Jianhua Author-X-Name-Last: Guo Title: Fast Network Community Detection With Profile-Pseudo Likelihood Methods Abstract: The stochastic block model is one of the most studied network models for community detection, and fitting its likelihood function on large-scale networks is known to be challenging. One prominent work that overcomes this computational challenge is the fast pseudo-likelihood approach proposed by Amini et al. for fitting stochastic block models to large sparse networks. However, this approach does not have convergence guarantee, and may not be well suited for small and medium scale networks. In this article, we propose a novel likelihood based approach that decouples row and column labels in the likelihood function, enabling a fast alternating maximization. This new method is computationally efficient, performs well for both small- and large-scale networks, and has provable convergence guarantee. We show that our method provides strongly consistent estimates of communities in a stochastic block model. We further consider extensions of our proposed method to handle networks with degree heterogeneity and bipartite properties. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1359-1372 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1996378 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1996378 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1359-1372 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1990766_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Dongdong Li Author-X-Name-First: Dongdong Author-X-Name-Last: Li Author-Name: X. Joan Hu Author-X-Name-First: X. Joan Author-X-Name-Last: Hu Author-Name: Rui Wang Author-X-Name-First: Rui Author-X-Name-Last: Wang Title: Evaluating Association Between Two Event Times with Observations Subject to Informative Censoring Abstract: This article is concerned with evaluating the association between two event times without specifying the joint distribution parametrically. This is particularly challenging when the observations on the event times are subject to informative censoring due to a terminating event such as death. There are few methods suitable for assessing covariate effects on association in this context. We link the joint distribution of the two event times and the informative censoring time using a nested copula function. We use flexible functional forms to specify the covariate effects on both the marginal and joint distributions. In a semiparametric model for the bivariate event time, we estimate simultaneously the association parameters, the marginal survival functions, and the covariate effects. A byproduct of the approach is a consistent estimator for the induced marginal survival function of each event time conditional on the covariates. We develop an easy-to-implement pseudolikelihood-based inference procedure, derive the asymptotic properties of the estimators, and conduct simulation studies to examine the finite-sample performance of the proposed approach. For illustration, we apply our method to analyze data from the breast cancer survivorship study that motivated this research. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1282-1294 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1990766 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1990766 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1282-1294 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2151447_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Roulan Jiang Author-X-Name-First: Roulan Author-X-Name-Last: Jiang Author-Name: Xiang Zhan Author-X-Name-First: Xiang Author-X-Name-Last: Zhan Author-Name: Tianying Wang Author-X-Name-First: Tianying Author-X-Name-Last: Wang Title: A Flexible Zero-Inflated Poisson-Gamma Model with Application to Microbiome Sequence Count Data Abstract: In microbiome studies, it is of interest to use a sample from a population of microbes, such as the gut microbiota community, to estimate the population proportion of these taxa. However, due to biases introduced in sampling and preprocessing steps, these observed taxa abundances may not reflect true taxa abundance patterns in the ecosystem. Repeated measures, including longitudinal study designs, may be potential solutions to mitigate the discrepancy between observed abundances and true underlying abundances. Yet, widely observed zero-inflation and over-dispersion issues can distort downstream statistical analyses aiming to associate taxa abundances with covariates of interest. To this end, we propose a Zero-Inflated Poisson Gamma (ZIPG) model framework to address these aforementioned challenges. From a perspective of measurement errors, we accommodate the discrepancy between observations and truths by decomposing the mean parameter in Poisson regression into a true abundance level and a multiplicative measurement of sampling variability from the microbial ecosystem. Then, we provide a flexible ZIPG model framework by connecting both the mean abundance and the variability of abundances to different covariates, and build valid statistical inference procedures for both parameter estimation and hypothesis testing. Through comprehensive simulation studies and real data applications, the proposed ZIPG method provides significant insights into distinguished differential variability and mean abundance. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 792-804 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2022.2151447 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2151447 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:792-804 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1996379_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Stan Tendijck Author-X-Name-First: Stan Author-X-Name-Last: Tendijck Author-Name: Emma Eastoe Author-X-Name-First: Emma Author-X-Name-Last: Eastoe Author-Name: Jonathan Tawn Author-X-Name-First: Jonathan Author-X-Name-Last: Tawn Author-Name: David Randell Author-X-Name-First: David Author-X-Name-Last: Randell Author-Name: Philip Jonathan Author-X-Name-First: Philip Author-X-Name-Last: Jonathan Title: Modeling the Extremes of Bivariate Mixture Distributions With Application to Oceanographic Data Abstract: There currently exist a variety of statistical methods for modeling bivariate extremes. However, when the dependence between variables is driven by more than one latent process, these methods are likely to fail to give reliable inferences. We consider situations in which the observed dependence at extreme levels is a mixture of a possibly unknown number of much simpler bivariate distributions. For such structures, we demonstrate the limitations of existing methods and propose two new methods: an extension of the Heffernan–Tawn conditional extreme value model to allow for mixtures and an extremal quantile-regression approach. The two methods are examined in a simulation study and then applied to oceanographic data. Finally, we discuss extensions including a subasymptotic version of the proposed model, which has the potential to give more efficient results by incorporating data that are less extreme. Both new methods outperform existing approaches when mixtures are present. Journal: Journal of the American Statistical Association Pages: 1373-1384 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1996379 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1996379 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1373-1384 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2000867_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Xinran Li Author-X-Name-First: Xinran Author-X-Name-Last: Li Author-Name: Bo Jiang Author-X-Name-First: Bo Author-X-Name-Last: Jiang Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Title: Kernel-Based Partial Permutation Test for Detecting Heterogeneous Functional Relationship Abstract: We propose a kernel-based partial permutation test for checking the equality of functional relationship between response and covariates among different groups. The main idea, which is intuitive and easy to implement, is to keep the projections of the response vector Y on leading principle components of a kernel matrix fixed and permute Y’s projections on the remaining principle components. The proposed test allows for different choices of kernels, corresponding to different classes of functions under the null hypothesis. First, using linear or polynomial kernels, our partial permutation tests are exactly valid in finite samples for linear or polynomial regression models with Gaussian noise; similar results straightforwardly extend to kernels with finite feature spaces. Second, by allowing the kernel feature space to diverge with the sample size, the test can be large-sample valid for a wider class of functions. Third, for general kernels with possibly infinite-dimensional feature space, the partial permutation test is exactly valid when the covariates are exactly balanced across all groups, or asymptotically valid when the underlying function follows certain regularized Gaussian processes. We further suggest test statistics using likelihood ratio between two (nested) Gaussian process regression models, and propose computationally efficient algorithms utilizing the EM algorithm and Newton’s method, where the latter also involves Fisher scoring and quadratic programming and is particularly useful when EM suffers from slow convergence. Extensions to correlated and non-Gaussian noises have also been investigated theoretically or numerically. Furthermore, the test can be extended to use multiple kernels together and can thus enjoy properties from each kernel. Both simulation study and application illustrate the properties of the proposed test. Journal: Journal of the American Statistical Association Pages: 1429-1447 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.2000867 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2000867 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1429-1447 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1969238_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Yongyi Guo Author-X-Name-First: Yongyi Author-X-Name-Last: Guo Author-Name: Kaizheng Wang Author-X-Name-First: Kaizheng Author-X-Name-Last: Wang Title: Communication-Efficient Accurate Statistical Estimation Abstract: When the data are stored in a distributed manner, direct applications of traditional statistical inference procedures are often prohibitive due to communication costs and privacy concerns. This article develops and investigates two communication-efficient accurate statistical estimators (CEASE), implemented through iterative algorithms for distributed optimization. In each iteration, node machines carry out computation in parallel and communicate with the central processor, which then broadcasts aggregated information to node machines for new updates. The algorithms adapt to the similarity among loss functions on node machines, and converge rapidly when each node machine has large enough sample size. Moreover, they do not require good initialization and enjoy linear converge guarantees under general conditions. The contraction rate of optimization errors is presented explicitly, with dependence on the local sample size unveiled. In addition, the improved statistical accuracy per iteration is derived. By regarding the proposed method as a multistep statistical estimator, we show that statistical efficiency can be achieved in finite steps in typical statistical applications. In addition, we give the conditions under which the one-step CEASE estimator is statistically efficient. Extensive numerical experiments on both synthetic and real data validate the theoretical results and demonstrate the superior performance of our algorithms. Journal: Journal of the American Statistical Association Pages: 1000-1010 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1969238 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1969238 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1000-1010 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1996377_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Reza Mohammadi Author-X-Name-First: Reza Author-X-Name-Last: Mohammadi Author-Name: Hélène Massam Author-X-Name-First: Hélène Author-X-Name-Last: Massam Author-Name: Gérard Letac Author-X-Name-First: Gérard Author-X-Name-Last: Letac Title: Accelerating Bayesian Structure Learning in Sparse Gaussian Graphical Models Abstract: Bayesian structure learning in Gaussian graphical models is often done by search algorithms over the graph space.The conjugate prior for the precision matrix satisfying graphical constraints is the well-known G-Wishart.With this prior, the transition probabilities in the search algorithms necessitate evaluating the ratios of the prior normalizing constants of G-Wishart.In moderate to high-dimensions, this ratio is often approximated by using sampling-based methods as computationally expensive updates in the search algorithm.Calculating this ratio so far has been a major computational bottleneck.We overcome this issue by representing a search algorithm in which the ratio of normalizing constants is carried out by an explicit closed-form approximation.Using this approximation within our search algorithm yields significant improvement in the scalability of structure learning without sacrificing structure learning accuracy.We study the conditions under which the approximation is valid.We also evaluate the efficacy of our method with simulation studies.We show that the new search algorithm with our approximation outperforms state-of-the-art methods in both computational efficiency and accuracy.The implementation of our work is available in the R package BDgraph. Journal: Journal of the American Statistical Association Pages: 1345-1358 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1996377 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1996377 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1345-1358 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1963262_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Efstathios Paparoditis Author-X-Name-First: Efstathios Author-X-Name-Last: Paparoditis Author-Name: Han Lin Shang Author-X-Name-First: Han Lin Author-X-Name-Last: Shang Title: Bootstrap Prediction Bands for Functional Time Series Abstract: A bootstrap procedure for constructing prediction bands for a stationary functional time series is proposed. The procedure exploits a general vector autoregressive representation of the time-reversed series of Fourier coefficients appearing in the Karhunen–Loève representation of the functional process. It generates backward-in-time functional replicates that adequately mimic the dependence structure of the underlying process in a model-free way and have the same conditionally fixed curves at the end of each functional pseudo-time series. The bootstrap prediction error distribution is then calculated as the difference between the model-free, bootstrap-generated future functional observations and the functional forecasts obtained from the model used for prediction. This allows the estimated prediction error distribution to account for the innovation and estimation errors associated with prediction and the possible errors due to model misspecification. We establish the asymptotic validity of the bootstrap procedure in estimating the conditional prediction error distribution of interest, and we also show that the procedure enables the construction of prediction bands that achieve (asymptotically) the desired coverage. Prediction bands based on a consistent estimation of the conditional distribution of the studentized prediction error process also are introduced. Such bands allow for taking more appropriately into account the local uncertainty of the prediction. Through a simulation study and the analysis of two datasets, we demonstrate the capabilities and the good finite-sample performance of the proposed method. Journal: Journal of the American Statistical Association Pages: 972-986 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1963262 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1963262 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:972-986 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1987251_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Francis K. C. Hui Author-X-Name-First: Francis K. C. Author-X-Name-Last: Hui Author-Name: Samuel Müller Author-X-Name-First: Samuel Author-X-Name-Last: Müller Author-Name: A. H. Welsh Author-X-Name-First: A. H. Author-X-Name-Last: Welsh Title: GEE-Assisted Variable Selection for Latent Variable Models with Multivariate Binary Data Abstract: Multivariate data are commonly analyzed using one of two approaches: a conditional approach based on generalized linear latent variable models (GLLVMs) or some variation thereof, and a marginal approach based on generalized estimating equations (GEEs). With research on mixed models and GEEs having gone down separate paths, there is a common mindset to treat the two approaches as mutually exclusive, with which to use driven by the question of interest. In this article, focusing on multivariate binary responses, we study the connections between the parameters from conditional and marginal models, with the aim of using GEEs for fast variable selection in GLLVMs. This is accomplished through two main contributions. First, we show that GEEs are zero consistent for GLLVMs fitted to multivariate binary data. That is, if the true model is a GLLVM but we misspecify and fit GEEs, then the latter is able to asymptotically differentiate between truly zero versus nonzero coefficients in the former. Building on this result, we propose GEE-assisted variable selection for GLLVMs using score- and Wald-based information criteria to construct a fast forward selection path followed by pruning. We demonstrate GEE-assisted variable selection is selection consistent for the underlying GLLVM, with simulation studies demonstrating its strong finite sample performance and computational efficiency. Journal: Journal of the American Statistical Association Pages: 1252-1263 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1987251 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1987251 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1252-1263 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1956501_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yuxin Chen Author-X-Name-First: Yuxin Author-X-Name-Last: Chen Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Bingyan Wang Author-X-Name-First: Bingyan Author-X-Name-Last: Wang Author-Name: Yuling Yan Author-X-Name-First: Yuling Author-X-Name-Last: Yan Title: Convex and Nonconvex Optimization Are Both Minimax-Optimal for Noisy Blind Deconvolution Under Random Designs Abstract: We investigate the effectiveness of convex relaxation and nonconvex optimization in solving bilinear systems of equations under two different designs (i.e., a sort of random Fourier design and Gaussian design). Despite the wide applicability, the theoretical understanding about these two paradigms remains largely inadequate in the presence of random noise. The current article makes two contributions by demonstrating that (i) a two-stage nonconvex algorithm attains minimax-optimal accuracy within a logarithmic number of iterations, and (ii) convex relaxation also achieves minimax-optimal statistical accuracy vis-à-vis random noise. Both results significantly improve upon the state-of-the-art theoretical guarantees. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 858-868 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1956501 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1956501 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:858-868 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1981338_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Bingkai Wang Author-X-Name-First: Bingkai Author-X-Name-Last: Wang Author-Name: Ryoko Susukida Author-X-Name-First: Ryoko Author-X-Name-Last: Susukida Author-Name: Ramin Mojtabai Author-X-Name-First: Ramin Author-X-Name-Last: Mojtabai Author-Name: Masoumeh Amin-Esmaeili Author-X-Name-First: Masoumeh Author-X-Name-Last: Amin-Esmaeili Author-Name: Michael Rosenblum Author-X-Name-First: Michael Author-X-Name-Last: Rosenblum Title: Model-Robust Inference for Clinical Trials that Improve Precision by Stratified Randomization and Covariate Adjustment Abstract: Two commonly used methods for improving precision and power in clinical trials are stratified randomization and covariate adjustment. However, many trials do not fully capitalize on the combined precision gains from these two methods, which can lead to wasted resources in terms of sample size and trial duration. We derive consistency and asymptotic normality of model-robust estimators that combine these two methods, and show that these estimators can lead to substantial gains in precision and power. Our theorems cover a class of estimators that handle continuous, binary, and time-to-event outcomes; missing outcomes under the missing at random assumption are handled as well. For each estimator, we give a formula for a consistent variance estimator that is model-robust and that fully captures variance reductions from stratified randomization and covariate adjustment. Also, we give the first proof (to the best of our knowledge) of consistency and asymptotic normality of the Kaplan–Meier estimator under stratified randomization, and we derive its asymptotic variance. The above results also hold for the biased-coin covariate-adaptive design. We demonstrate our results using data from three trials of substance use disorder treatments, where the variance reduction due to stratified randomization and covariate adjustment ranges from 1% to 36%. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1152-1163 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1981338 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1981338 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1152-1163 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1999820_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Hui Chen Author-X-Name-First: Hui Author-X-Name-Last: Chen Author-Name: Haojie Ren Author-X-Name-First: Haojie Author-X-Name-Last: Ren Author-Name: Fang Yao Author-X-Name-First: Fang Author-X-Name-Last: Yao Author-Name: Changliang Zou Author-X-Name-First: Changliang Author-X-Name-Last: Zou Title: Data-driven selection of the number of change-points via error rate control Abstract: In multiple change-point analysis, one of the main difficulties is to determine the number of change-points. Various consistent selection methods, including the use of Schwarz information criterion and cross-validation, have been proposed to balance the model fitting and complexity. However, there is lack of systematic approaches to provide theoretical guarantee of significance in determining the number of changes. In this paper, we introduce a data-adaptive selection procedure via error rate control based on order-preserving sample-splitting, which is applicable to most existing change-point methods. The key idea is to construct a series of statistics with global symmetry property and then utilize the symmetry to derive a data-driven threshold. Under this general framework, we are able to rigorously investigate the false discovery proportion control, and show that the proposed method controls the false discovery rate (FDR) asymptotically under mild conditions while retaining the true change-points. Numerical experiments indicate that our selection procedure works well for many change-detection methods and is able to yield accurate FDR control in finite samples. Keywords: Empirical distribution; False discovery rate; Multiple change-point model; Sample-splitting; Symmetry; Uniform convergence. Journal: Journal of the American Statistical Association Pages: 1415-1428 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1999820 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1999820 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1415-1428 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1982723_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Zexi Song Author-X-Name-First: Zexi Author-X-Name-Last: Song Author-Name: Zhiqiang Tan Author-X-Name-First: Zhiqiang Author-X-Name-Last: Tan Title: Hamiltonian-Assisted Metropolis Sampling Abstract: Various Markov chain Monte Carlo (MCMC) methods are studied to improve upon random walk Metropolis sampling, for simulation from complex distributions. Examples include Metropolis-adjusted Langevin algorithms, Hamiltonian Monte Carlo, and other algorithms related to underdamped Langevin dynamics. We propose a broad class of irreversible sampling algorithms, called Hamiltonian-assisted Metropolis sampling (HAMS), and develop two specific algorithms with appropriate tuning and preconditioning strategies. Our HAMS algorithms are designed to simultaneously achieve two distinctive properties, while using an augmented target density with a momentum as an auxiliary variable. One is generalized detailed balance, which induces an irreversible exploration of the target. The other is a rejection-free property for a Gaussian target with a prespecified variance matrix. This property allows our preconditioned algorithms to perform satisfactorily with relatively large step sizes. Furthermore, we formulate a framework of generalized Metropolis–Hastings sampling, which not only highlights our construction of HAMS at a more abstract level, but also facilitates possible further development of irreversible MCMC algorithms. We present several numerical experiments, where the proposed algorithms consistently yield superior results among existing algorithms using the same preconditioning schemes. Journal: Journal of the American Statistical Association Pages: 1176-1194 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1982723 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1982723 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1176-1194 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1961784_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Tamara Fernández Author-X-Name-First: Tamara Author-X-Name-Last: Fernández Author-Name: Arthur Gretton Author-X-Name-First: Arthur Author-X-Name-Last: Gretton Author-Name: David Rindt Author-X-Name-First: David Author-X-Name-Last: Rindt Author-Name: Dino Sejdinovic Author-X-Name-First: Dino Author-X-Name-Last: Sejdinovic Title: A Kernel Log-Rank Test of Independence for Right-Censored Data Abstract: We introduce a general nonparametric independence test between right-censored survival times and covariates, which may be multivariate. Our test statistic has a dual interpretation, first in terms of the supremum of a potentially infinite collection of weight-indexed log-rank tests, with weight functions belonging to a reproducing kernel Hilbert space (RKHS) of functions; and second, as the norm of the difference of embeddings of certain finite measures into the RKHS, similar to the Hilbert–Schmidt Independence Criterion (HSIC) test-statistic. We study the asymptotic properties of the test, finding sufficient conditions to ensure our test correctly rejects the null hypothesis under any alternative. The test statistic can be computed straightforwardly, and the rejection threshold is obtained via an asymptotically consistent Wild Bootstrap procedure. Extensive investigations on both simulated and real data suggest that our testing procedure generally performs better than competing approaches in detecting complex nonlinear dependence. Journal: Journal of the American Statistical Association Pages: 925-936 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1961784 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1961784 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:925-936 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2183001_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Susan S. Ellenberg Author-X-Name-First: Susan S. Author-X-Name-Last: Ellenberg Title: Statistical Thinking in Clinical Trials Journal: Journal of the American Statistical Association Pages: 1448-1449 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2023.2183001 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2183001 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1448-1449 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1987250_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Mehdi Dagdoug Author-X-Name-First: Mehdi Author-X-Name-Last: Dagdoug Author-Name: Camelia Goga Author-X-Name-First: Camelia Author-X-Name-Last: Goga Author-Name: David Haziza Author-X-Name-First: David Author-X-Name-Last: Haziza Title: Model-Assisted Estimation Through Random Forests in Finite Population Sampling Abstract: In surveys, the interest lies in estimating finite population parameters such as population totals and means. In most surveys, some auxiliary information is available at the estimation stage. This information may be incorporated in the estimation procedures to increase their precision. In this article, we use random forests (RFs) to estimate the functional relationship between the survey variable and the auxiliary variables. In recent years, RFs have become attractive as National Statistical Offices have now access to a variety of data sources, potentially exhibiting a large number of observations on a large number of variables. We establish the theoretical properties of model-assisted procedures based on RFs and derive corresponding variance estimators. A model-calibration procedure for handling multiple survey variables is also discussed. The results of a simulation study suggest that the proposed point and estimation procedures perform well in terms of bias, efficiency and coverage of normal-based confidence intervals, in a wide variety of settings. Finally, we apply the proposed methods using data on radio audiences collected by Médiamétrie, a French audience company. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1234-1251 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1987250 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1987250 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1234-1251 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1990768_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Florian Gunsilius Author-X-Name-First: Florian Author-X-Name-Last: Gunsilius Author-Name: Susanne Schennach Author-X-Name-First: Susanne Author-X-Name-Last: Schennach Title: Independent Nonlinear Component Analysis Abstract: The idea of summarizing the information contained in a large number of variables by a small number of “factors” or “principal components” has been broadly adopted in statistics. This article introduces a generalization of the widely used principal component analysis (PCA) to nonlinear settings, thus providing a new tool for dimension reduction and exploratory data analysis or representation. The distinguishing features of the method include (i) the ability to always deliver truly independent (instead of merely uncorrelated) factors; (ii) the use of optimal transport theory and Brenier maps to obtain a robust and efficient computational algorithm; (iii) the use of a new multivariate additive entropy decomposition to determine the most informative principal nonlinear components, and (iv) formally nesting PCA as a special case for linear Gaussian factor models. We illustrate the method’s effectiveness in an application to excess bond returns prediction from a large number of macro factors. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1305-1318 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1990768 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1990768 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1305-1318 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1969239_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Fabian Mies Author-X-Name-First: Fabian Author-X-Name-Last: Mies Title: Functional Estimation and Change Detection for Nonstationary Time Series Abstract: Tests for structural breaks in time series should ideally be sensitive to breaks in the parameter of interest, while being robust to nuisance changes. Statistical analysis thus needs to allow for some form of nonstationarity under the null hypothesis of no change. In this article, estimators for integrated parameters of locally stationary time series are constructed and a corresponding functional central limit theorem is established, enabling change-point inference for a broad class of parameters under mild assumptions. The proposed framework covers all parameters which may be expressed as nonlinear functions of moments, for example kurtosis, autocorrelation, and coefficients in a linear regression model. To perform feasible inference based on the derived limit distribution, a bootstrap variant is proposed and its consistency is established. The methodology is illustrated by means of a simulation study and by an application to high-frequency asset prices. Journal: Journal of the American Statistical Association Pages: 1011-1022 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1969239 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1969239 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1011-1022 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1978467_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yinghao Pan Author-X-Name-First: Yinghao Author-X-Name-Last: Pan Author-Name: Eric B. Laber Author-X-Name-First: Eric B. Author-X-Name-Last: Laber Author-Name: Maureen A. Smith Author-X-Name-First: Maureen A. Author-X-Name-Last: Smith Author-Name: Ying-Qi Zhao Author-X-Name-First: Ying-Qi Author-X-Name-Last: Zhao Title: Reinforced Risk Prediction With Budget Constraint Using Irregularly Measured Data From Electronic Health Records Abstract: Uncontrolled glycated hemoglobin (HbA1c) levels are associated with adverse events among complex diabetic patients. These adverse events present serious health risks to affected patients and are associated with significant financial costs. Thus, a high-quality predictive model that could identify high-risk patients so as to inform preventative treatment has the potential to improve patient outcomes while reducing healthcare costs. Because the biomarker information needed to predict risk is costly and burdensome, it is desirable that such a model collect only as much information as is needed on each patient so as to render an accurate prediction. We propose a sequential predictive model that uses accumulating patient longitudinal data to classify patients as: high-risk, low-risk, or uncertain. Patients classified as high-risk are then recommended to receive preventative treatment and those classified as low-risk are recommended to standard care. Patients classified as uncertain are monitored until a high-risk or low-risk determination is made. We construct the model using claims and enrollment files from Medicare, linked with patient electronic health records (EHR) data. The proposed model uses functional principal components to accommodate noisy longitudinal data and weighting to deal with missingness and sampling bias. The proposed method demonstrates higher predictive accuracy and lower cost than competing methods in a series of simulation experiments and application to data on complex patients with diabetes. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1090-1101 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1978467 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1978467 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1090-1101 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1962720_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Zhimei Ren Author-X-Name-First: Zhimei Author-X-Name-Last: Ren Author-Name: Yuting Wei Author-X-Name-First: Yuting Author-X-Name-Last: Wei Author-Name: Emmanuel Candès Author-X-Name-First: Emmanuel Author-X-Name-Last: Candès Title: Derandomizing Knockoffs Abstract: Model-X knockoffs is a general procedure that can leverage any feature importance measure to produce a variable selection algorithm, which discovers true effects while rigorously controlling the number or fraction of false positives. Model-X knockoffs is a randomized procedure which relies on the one-time construction of synthetic (random) variables. This article introduces a derandomization method by aggregating the selection results across multiple runs of the knockoffs algorithm. The derandomization step is designed to be flexible and can be adapted to any variable selection base procedure to yield stable decisions without compromising statistical power. When applied to the base procedure of Janson and Su, we prove that derandomized knockoffs controls both the per family error rate (PFER) and the k family-wise error rate (k-FWER). Furthermore, we carry out extensive numerical studies demonstrating tight Type I error control and markedly enhanced power when compared with alternative variable selection algorithms. Finally, we apply our approach to multistage genome-wide association studies of prostate cancer and report locations on the genome that are significantly associated with the disease. When cross-referenced with other studies, we find that the reported associations have been replicated.Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 948-958 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1962720 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1962720 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:948-958 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2169150_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Niccolò Anceschi Author-X-Name-First: Niccolò Author-X-Name-Last: Anceschi Author-Name: Augusto Fasano Author-X-Name-First: Augusto Author-X-Name-Last: Fasano Author-Name: Daniele Durante Author-X-Name-First: Daniele Author-X-Name-Last: Durante Author-Name: Giacomo Zanella Author-X-Name-First: Giacomo Author-X-Name-Last: Zanella Title: Bayesian Conjugacy in Probit, Tobit, Multinomial Probit and Extensions: A Review and New Results Abstract: A broad class of models that routinely appear in several fields can be expressed as partially or fully discretized Gaussian linear regressions. Besides including classical Gaussian response settings, this class also encompasses probit, multinomial probit and tobit regression, among others, thereby yielding one of the most widely-implemented families of models in routine applications. The relevance of such representations has stimulated decades of research in the Bayesian field, mostly motivated by the fact that, unlike for Gaussian linear regression, the posterior distribution induced by such models does not seem to belong to a known class, under the commonly assumed Gaussian priors for the coefficients. This has motivated several solutions for posterior inference relying either on sampling-based strategies or on deterministic approximations that, however, still experience computational and accuracy issues, especially in high dimensions. The scope of this article is to review, unify and extend recent advances in Bayesian inference and computation for this core class of models. To address such a goal, we prove that the likelihoods induced by these formulations share a common analytical structure implying conjugacy with a broad class of distributions, namely the unified skew-normal (SUN), that generalize Gaussians to include skewness. This result unifies and extends recent conjugacy properties for specific models within the class analyzed, and opens new avenues for improved posterior inference, under a broader class of formulations and priors, via novel closed-form expressions, iid samplers from the exact SUN posteriors, and more accurate and scalable approximations from variational Bayes and expectation-propagation. Such advantages are illustrated in simulations and are expected to facilitate the routine-use of these core Bayesian models, while providing novel frameworks for studying theoretical properties and developing future extensions. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1451-1469 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2023.2169150 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2169150 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1451-1469 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2156348_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Cecilia Balocchi Author-X-Name-First: Cecilia Author-X-Name-Last: Balocchi Author-Name: Sameer K. Deshpande Author-X-Name-First: Sameer K. Author-X-Name-Last: Deshpande Author-Name: Edward I. George Author-X-Name-First: Edward I. Author-X-Name-Last: George Author-Name: Shane T. Jensen Author-X-Name-First: Shane T. Author-X-Name-Last: Jensen Title: Crime in Philadelphia: Bayesian Clustering with Particle Optimization Abstract: Accurate estimation of the change in crime over time is a critical first step toward better understanding of public safety in large urban environments. Bayesian hierarchical modeling is a natural way to study spatial variation in urban crime dynamics at the neighborhood level, since it facilitates principled “sharing of information” between spatially adjacent neighborhoods. Typically, however, cities contain many physical and social boundaries that may manifest as spatial discontinuities in crime patterns. In this situation, standard prior choices often yield overly smooth parameter estimates, which can ultimately produce mis-calibrated forecasts. To prevent potential over-smoothing, we introduce a prior that partitions the set of neighborhoods into several clusters and encourages spatial smoothness within each cluster. In terms of model implementation, conventional stochastic search techniques are computationally prohibitive, as they must traverse a combinatorially vast space of partitions. We introduce an ensemble optimization procedure that simultaneously identifies several high probability partitions by solving one optimization problem using a new local search strategy. We then use the identified partitions to estimate crime trends in Philadelphia between 2006 and 2017. On simulated and real data, our proposed method demonstrates good estimation and partition selection performance. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 818-829 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2022.2156348 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2156348 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:818-829 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1984927_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: David J. Edwards Author-X-Name-First: David J. Author-X-Name-Last: Edwards Author-Name: Robert W. Mee Author-X-Name-First: Robert W. Author-X-Name-Last: Mee Title: Structure of Nonregular Two-Level Designs Abstract: Two-level fractional factorial designs are often used in screening scenarios to identify active factors. This article investigates the block diagonal structure of the information matrix of nonregular two-level designs. This structure is appealing since estimates of parameters belonging to different diagonal submatrices are uncorrelated. As such, the covariance matrix of the least squares estimates is simplified and the number of linear dependencies is reduced. We connect the block diagonal information matrix to the parallel flats design (PFD) literature and gain insights into the structure of what is estimable and/or aliased using the concept of minimal dependent sets. We show how to determine the number of parallel flats for any given design, and how to construct a design with a specified number of parallel flats. The usefulness of our construction method is illustrated by producing designs for estimation of the two-factor interaction model with three or more parallel flats. We also provide a fuller understanding of recently proposed group orthogonal supersaturated designs. Benefits of PFDs for analysis, including bias containment, are also discussed. Journal: Journal of the American Statistical Association Pages: 1222-1233 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1984927 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1984927 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1222-1233 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1957900_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Zhe Fei Author-X-Name-First: Zhe Author-X-Name-Last: Fei Author-Name: Qi Zheng Author-X-Name-First: Qi Author-X-Name-Last: Zheng Author-Name: Hyokyoung G. Hong Author-X-Name-First: Hyokyoung G. Author-X-Name-Last: Hong Author-Name: Yi Li Author-X-Name-First: Yi Author-X-Name-Last: Li Title: Inference for High-Dimensional Censored Quantile Regression Abstract: With the availability of high-dimensional genetic biomarkers, it is of interest to identify heterogeneous effects of these predictors on patients’ survival, along with proper statistical inference. Censored quantile regression has emerged as a powerful tool for detecting heterogeneous effects of covariates on survival outcomes. To our knowledge, there is little work available to draw inferences on the effects of high-dimensional predictors for censored quantile regression (CQR). This article proposes a novel procedure to draw inference on all predictors within the framework of global CQR, which investigates covariate-response associations over an interval of quantile levels, instead of a few discrete values. The proposed estimator combines a sequence of low-dimensional model estimates that are based on multi-sample splittings and variable selection. We show that, under some regularity conditions, the estimator is consistent and asymptotically follows a Gaussian process indexed by the quantile level. Simulation studies indicate that our procedure can properly quantify the uncertainty of the estimates in high-dimensional settings. We apply our method to analyze the heterogeneous effects of SNPs residing in lung cancer pathways on patients’ survival, using the Boston Lung Cancer Survival Cohort, a cancer epidemiology study on the molecular mechanism of lung cancer. Journal: Journal of the American Statistical Association Pages: 898-912 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1957900 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1957900 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:898-912 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1956937_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yaqing Chen Author-X-Name-First: Yaqing Author-X-Name-Last: Chen Author-Name: Zhenhua Lin Author-X-Name-First: Zhenhua Author-X-Name-Last: Lin Author-Name: Hans-Georg Müller Author-X-Name-First: Hans-Georg Author-X-Name-Last: Müller Title: Wasserstein Regression Abstract: The analysis of samples of random objects that do not lie in a vector space is gaining increasing attention in statistics. An important class of such object data is univariate probability measures defined on the real line. Adopting the Wasserstein metric, we develop a class of regression models for such data, where random distributions serve as predictors and the responses are either also distributions or scalars. To define this regression model, we use the geometry of tangent bundles of the space of random measures endowed with the Wasserstein metric for mapping distributions to tangent spaces. The proposed distribution-to-distribution regression model provides an extension of multivariate linear regression for Euclidean data and function-to-function regression for Hilbert space-valued data in functional data analysis. In simulations, it performs better than an alternative transformation approach where one maps distributions to a Hilbert space through the log quantile density transformation and then applies traditional functional regression. We derive asymptotic rates of convergence for the estimator of the regression operator and for predicted distributions and also study an extension to autoregressive models for distribution-valued time series. The proposed methods are illustrated with data on human mortality and distributional time series of house prices. Journal: Journal of the American Statistical Association Pages: 869-882 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1956937 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1956937 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:869-882 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1961783_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yunzhang Zhu Author-X-Name-First: Yunzhang Author-X-Name-Last: Zhu Author-Name: Xiaotong Shen Author-X-Name-First: Xiaotong Author-X-Name-Last: Shen Author-Name: Hui Jiang Author-X-Name-First: Hui Author-X-Name-Last: Jiang Author-Name: Wing Hung Wong Author-X-Name-First: Wing Hung Author-X-Name-Last: Wong Title: Collaborative Multilabel Classification Abstract: In multilabel classification, strong label dependence is present for exploiting, particularly for word-to-word dependence defined by semantic labels. In such a situation, we develop a collaborative-learning framework to predict class labels based on label-predictor pairs and label-only data. For example, in image categorization and recognition, language expressions describe the content of an image together with a large number of words and phrases without associated images. This article proposes a new loss quantifying partial correctness for false positive and negative misclassifications due to label similarities. Given this loss, we develop the Bayes rule to capture label dependence by nonlinear classification. On this ground, we introduce a weighted random forest classifier for complete data and a stacking scheme for leveraging additional labels to enhance the performance of supervised learning based on label-predictor pairs. Importantly, we decompose multilabel classification into a sequence of independent learning tasks, based on which the computational complexity of our classifier becomes linear in the size of labels. Compared to existing classifiers without label-only data, the proposed classifier enjoys the computational benefit while enabling the detection of novel labels absent from training by exploring label dependence and leveraging label-only data for higher accuracy. Theoretically, we show that the proposed method reconstructs the Bayes performance consistently, achieving the desired learning accuracy. Numerically, we demonstrate that the proposed method compares favorably in terms of the proposed and Hamming losses against binary relevance and a regularized Ising classifier modeling conditional label dependence. Indeed, leveraging additional labels tends to improve the supervised performance, especially when the training sample is not very large, as in semisupervised learning. Finally, we demonstrate the utility of the proposed approach on the Microsoft COCO object detection challenge, PASCAL visual object classes challenge 2007, and Mediamill benchmark. Journal: Journal of the American Statistical Association Pages: 913-924 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1961783 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1961783 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:913-924 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1970569_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Elynn Y. Chen Author-X-Name-First: Elynn Y. Author-X-Name-Last: Chen Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Title: Statistical Inference for High-Dimensional Matrix-Variate Factor Models Abstract: This article considers the estimation and inference of the low-rank components in high-dimensional matrix-variate factor models, where each dimension of the matrix-variates (p × q) is comparable to or greater than the number of observations (T). We propose an estimation method called α-PCA that preserves the matrix structure and aggregates mean and contemporary covariance through a hyper-parameter α. We develop an inferential theory, establishing consistency, the rate of convergence, and the limiting distributions, under general conditions that allow for correlations across time, rows, or columns of the noise. We show both theoretical and empirical methods of choosing the best α, depending on the use-case criteria. Simulation results demonstrate the adequacy of the asymptotic results in approximating the finite sample properties. The α-PCA compares favorably with the existing ones. Finally, we illustrate its applications with a real numeric dataset and two real image datasets. In all applications, the proposed estimation procedure outperforms previous methods in the power of variance explanation using out-of-sample 10-fold cross-validation. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1038-1055 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1970569 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1970569 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1038-1055 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1990769_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: T. Tony Cai Author-X-Name-First: T. Tony Author-X-Name-Last: Cai Author-Name: Zijian Guo Author-X-Name-First: Zijian Author-X-Name-Last: Guo Author-Name: Rong Ma Author-X-Name-First: Rong Author-X-Name-Last: Ma Title: Statistical Inference for High-Dimensional Generalized Linear Models With Binary Outcomes Abstract: This article develops a unified statistical inference framework for high-dimensional binary generalized linear models (GLMs) with general link functions. Both unknown and known design distribution settings are considered. A two-step weighted bias-correction method is proposed for constructing confidence intervals (CIs) and simultaneous hypothesis tests for individual components of the regression vector. Minimax lower bound for the expected length is established and the proposed CIs are shown to be rate-optimal up to a logarithmic factor. The numerical performance of the proposed procedure is demonstrated through simulation studies and an analysis of a single cell RNA-seq dataset, which yields interesting biological insights that integrate well into the current literature on the cellular immune response mechanisms as characterized by single-cell transcriptomics. The theoretical analysis provides important insights on the adaptivity of optimal CIs with respect to the sparsity of the regression vector. New lower bound techniques are introduced and they can be of independent interest to solve other inference problems in high-dimensional binary GLMs. Journal: Journal of the American Statistical Association Pages: 1319-1332 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1990769 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1990769 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1319-1332 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1963261_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yu-Ting Chen Author-X-Name-First: Yu-Ting Author-X-Name-Last: Chen Author-Name: Jeng-Min Chiou Author-X-Name-First: Jeng-Min Author-X-Name-Last: Chiou Author-Name: Tzee-Ming Huang Author-X-Name-First: Tzee-Ming Author-X-Name-Last: Huang Title: Greedy Segmentation for a Functional Data Sequence Abstract: We present a new approach known as greedy segmentation (GS) to identify multiple changepoints for a functional data sequence. The proposed multiple changepoint detection criterion links detectability with the projection onto a suitably chosen subspace and the changepoint locations. The changepoint estimator identifies the true changepoints for any predetermined number of changepoint candidates, either over-reporting or under-reporting. This theoretical finding supports the proposed GS estimator, which can be efficiently obtained in a greedy manner. The GS estimator’s consistency holds without being restricted to the conventional at most one changepoint condition, and it is robust to the relative positions of the changepoints. Based on the GS estimator, the test statistic’s asymptotic distribution leads to the novel GS algorithm, which identifies the number and locations of changepoints. Using intensive simulation studies, we compare the finite sample performance of the GS approach with other competing methods. We also apply our method to temporal changepoint detection in weather datasets. Journal: Journal of the American Statistical Association Pages: 959-971 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1963261 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1963261 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:959-971 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1983437_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Chan Park Author-X-Name-First: Chan Author-X-Name-Last: Park Author-Name: Hyunseung Kang Author-X-Name-First: Hyunseung Author-X-Name-Last: Kang Title: Assumption-Lean Analysis of Cluster Randomized Trials in Infectious Diseases for Intent-to-Treat Effects and Network Effects Abstract: Cluster randomized trials (CRTs) are a popular design to study the effect of interventions in infectious disease settings. However, standard analysis of CRTs primarily relies on strong parametric methods, usually mixed-effect models to account for the clustering structure, and focuses on the overall intent-to-treat (ITT) effect to evaluate effectiveness. The article presents two assumption-lean methods to analyze two types of effects in CRTs, ITT effects and network effects among well-known compliance groups. For the ITT effects, we study the overall and the heterogeneous ITT effects among the observed covariates where we do not impose parametric models or asymptotic restrictions on cluster size. For the network effects among compliance groups, we propose a new bound-based method that uses pretreatment covariates, classification algorithms, and a linear program to obtain sharp bounds. A key feature of our method is that the bounds can become narrower as the classification algorithm improves and the method may also be useful for studies of partial identification with instrumental variables. We conclude by reanalyzing a CRT studying the effect of face masks and hand sanitizers on transmission of 2008 interpandemic influenza in Hong Kong. Journal: Journal of the American Statistical Association Pages: 1195-1206 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1983437 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1983437 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1195-1206 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1979011_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yash Deshpande Author-X-Name-First: Yash Author-X-Name-Last: Deshpande Author-Name: Adel Javanmard Author-X-Name-First: Adel Author-X-Name-Last: Javanmard Author-Name: Mohammad Mehrabi Author-X-Name-First: Mohammad Author-X-Name-Last: Mehrabi Title: Online Debiasing for Adaptively Collected High-Dimensional Data With Applications to Time Series Analysis Abstract: Adaptive collection of data is commonplace in applications throughout science and engineering. From the point of view of statistical inference, however, adaptive data collection induces memory and correlation in the samples, and poses significant challenge. We consider the high-dimensional linear regression, where the samples are collected adaptively, and the sample size n can be smaller than p, the number of covariates. In this setting, there are two distinct sources of bias: the first due to regularization imposed for consistent estimation, for example, using the LASSO, and the second due to adaptivity in collecting the samples. We propose “online debiasing,” a general procedure for estimators such as the LASSO, which addresses both sources of bias. In two concrete contexts (i) time series analysis and (ii) batched data collection, we demonstrate that online debiasing optimally debiases the LASSO estimate when the underlying parameter θ0 has sparsity of order o(n/ log p) . In this regime, the debiased estimator can be used to compute p-values and confidence intervals of optimal size. Journal: Journal of the American Statistical Association Pages: 1126-1139 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1979011 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1979011 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1126-1139 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1999819_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Aaron J. Molstad Author-X-Name-First: Aaron J. Author-X-Name-Last: Molstad Author-Name: Adam J. Rothman Author-X-Name-First: Adam J. Author-X-Name-Last: Rothman Title: A Likelihood-Based Approach for Multivariate Categorical Response Regression in High Dimensions Abstract: We propose a penalized likelihood method to fit the bivariate categorical response regression model. Our method allows practitioners to estimate which predictors are irrelevant, which predictors only affect the marginal distributions of the bivariate response, and which predictors affect both the marginal distributions and log odds ratios. To compute our estimator, we propose an efficient algorithm which we extend to settings where some subjects have only one response variable measured, that is, a semi-supervised setting. We derive an asymptotic error bound which illustrates the performance of our estimator in high-dimensional settings. Generalizations to the multivariate categorical response regression model are proposed. Finally, simulation studies and an application in pan-cancer risk prediction demonstrate the usefulness of our method in terms of interpretability and prediction accuracy. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1402-1414 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1999819 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1999819 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1402-1414 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1974867_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Edward L. Ionides Author-X-Name-First: Edward L. Author-X-Name-Last: Ionides Author-Name: Kidus Asfaw Author-X-Name-First: Kidus Author-X-Name-Last: Asfaw Author-Name: Joonha Park Author-X-Name-First: Joonha Author-X-Name-Last: Park Author-Name: Aaron A. King Author-X-Name-First: Aaron A. Author-X-Name-Last: King Title: Bagged Filters for Partially Observed Interacting Systems Abstract: Bagging (i.e., bootstrap aggregating) involves combining an ensemble of bootstrap estimators. We consider bagging for inference from noisy or incomplete measurements on a collection of interacting stochastic dynamic systems. Each system is called a unit, and each unit is associated with a spatial location. A motivating example arises in epidemiology, where each unit is a city: the majority of transmission occurs within a city, with smaller yet epidemiologically important interactions arising from disease transmission between cities. Monte Carlo filtering methods used for inference on nonlinear non-Gaussian systems can suffer from a curse of dimensionality (COD) as the number of units increases. We introduce bagged filter (BF) methodology which combines an ensemble of Monte Carlo filters, using spatiotemporally localized weights to select successful filters at each unit and time. We obtain conditions under which likelihood evaluation using a BF algorithm can beat a COD, and we demonstrate applicability even when these conditions do not hold. BF can out-perform an ensemble Kalman filter on a coupled population dynamics model describing infectious disease transmission. A block particle filter (BPF) also performs well on this task, though the bagged filter respects smoothness and conservation laws that a BPF can violate. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1078-1089 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1974867 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1974867 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1078-1089 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1970571_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Francesca Gasperoni Author-X-Name-First: Francesca Author-X-Name-Last: Gasperoni Author-Name: Alessandra Luati Author-X-Name-First: Alessandra Author-X-Name-Last: Luati Author-Name: Lucia Paci Author-X-Name-First: Lucia Author-X-Name-Last: Paci Author-Name: Enzo D’Innocenzo Author-X-Name-First: Enzo Author-X-Name-Last: D’Innocenzo Title: Score-Driven Modeling of Spatio-Temporal Data Abstract: A simultaneous autoregressive score-driven model with autoregressive disturbances is developed for spatio-temporal data that may exhibit heavy tails. The model specification rests on a signal plus noise decomposition of a spatially filtered process, where the signal can be approximated by a nonlinear function of the past variables and a set of explanatory variables, while the noise follows a multivariate Student-t distribution. The key feature of the model is that the dynamics of the space-time varying signal are driven by the score of the conditional likelihood function. When the distribution is heavy-tailed, the score provides a robust update of the space-time varying location. Consistency and asymptotic normality of maximum likelihood estimators are derived along with the stochastic properties of the model. The motivating application of the proposed model comes from brain scans recorded through functional magnetic resonance imaging when subjects are at rest and not expected to react to any controlled stimulus. We identify spontaneous activations in brain regions as extreme values of a possibly heavy-tailed distribution, by accounting for spatial and temporal dependence. Journal: Journal of the American Statistical Association Pages: 1066-1077 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1970571 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1970571 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1066-1077 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1969240_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Danielle C. Tucker Author-X-Name-First: Danielle C. Author-X-Name-Last: Tucker Author-Name: Yichao Wu Author-X-Name-First: Yichao Author-X-Name-Last: Wu Author-Name: Hans-Georg Müller Author-X-Name-First: Hans-Georg Author-X-Name-Last: Müller Title: Variable Selection for Global Fréchet Regression Abstract: Global Fréchet regression is an extension of linear regression to cover more general types of responses, such as distributions, networks, and manifolds, which are becoming more prevalent. In such models, predictors are Euclidean while responses are metric space valued. Predictor selection is of major relevance for regression modeling in the presence of multiple predictors but has not yet been addressed for Fréchet regression. Due to the metric space-valued nature of the responses, Fréchet regression models do not feature model parameters, and this lack of parameters makes it a major challenge to extend existing variable selection methods for linear regression to global Fréchet regression. In this work, we address this challenge and propose a novel variable selection method that overcomes it and has good practical performance. We provide theoretical support and demonstrate that the proposed variable selection method achieves selection consistency. We also explore the finite sample performance of the proposed method with numerical examples and data illustrations. Journal: Journal of the American Statistical Association Pages: 1023-1037 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1969240 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1969240 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1023-1037 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1990765_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Fan Xia Author-X-Name-First: Fan Author-X-Name-Last: Xia Author-Name: Kwun Chuen Gary Chan Author-X-Name-First: Kwun Chuen Gary Author-X-Name-Last: Chan Title: Identification, Semiparametric Efficiency, and Quadruply Robust Estimation in Mediation Analysis with Treatment-Induced Confounding Abstract: Natural mediation effects are often of interest when the goal is to understand a causal mechanism. However, most existing methods and their identification assumptions preclude treatment-induced confounders often present in practice. To address this fundamental limitation, we provide a set of assumptions that identify the natural direct effect in the presence of treatment-induced confounders. Even when some of those assumptions are violated, the estimand still has an interventional direct effect interpretation. We derive the semiparametric efficiency bound for the estimand, which unlike usual expressions, contains conditional densities that are variational dependent. We consider a reparameterization and propose a quadruply robust estimator that remains consistent under four types of possible misspecification and is also locally semiparametric efficient. We use simulation studies to demonstrate the proposed method and study an application to the 2017 Natality data to investigate the effect of prenatal care on preterm birth mediated by preeclampsia with smoking status during pregnancy being a potential treatment-induced confounder. Journal: Journal of the American Statistical Association Pages: 1272-1281 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1990765 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1990765 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1272-1281 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1956938_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Rui Tuo Author-X-Name-First: Rui Author-X-Name-Last: Tuo Author-Name: Shiyuan He Author-X-Name-First: Shiyuan Author-X-Name-Last: He Author-Name: Arash Pourhabib Author-X-Name-First: Arash Author-X-Name-Last: Pourhabib Author-Name: Yu Ding Author-X-Name-First: Yu Author-X-Name-Last: Ding Author-Name: Jianhua Z. Huang Author-X-Name-First: Jianhua Z. Author-X-Name-Last: Huang Title: A Reproducing Kernel Hilbert Space Approach to Functional Calibration of Computer Models Abstract: This article develops a frequentist solution to the functional calibration problem, where the value of a calibration parameter in a computer model is allowed to vary with the value of control variables in the physical system. The need of functional calibration is motivated by engineering applications where using a constant calibration parameter results in a significant mismatch between outputs from the computer model and the physical experiment. Reproducing kernel Hilbert spaces (RKHS) are used to model the optimal calibration function, defined as the functional relationship between the calibration parameter and control variables that gives the best prediction. This optimal calibration function is estimated through penalized least squares with an RKHS-norm penalty and using physical data. An uncertainty quantification procedure is also developed for such estimates. Theoretical guarantees of the proposed method are provided in terms of prediction consistency and consitency of estimating the optimal calibration function. The proposed method is tested using both real and synthetic data and exhibits more robust performance in prediction and uncertainty quantification than the existing parametric functional calibration method and a state-of-art Bayesian method. Journal: Journal of the American Statistical Association Pages: 883-897 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1956938 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1956938 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:883-897 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1967164_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Nikolaos Ignatiadis Author-X-Name-First: Nikolaos Author-X-Name-Last: Ignatiadis Author-Name: Sujayam Saha Author-X-Name-First: Sujayam Author-X-Name-Last: Saha Author-Name: Dennis L. Sun Author-X-Name-First: Dennis L. Author-X-Name-Last: Sun Author-Name: Omkar Muralidharan Author-X-Name-First: Omkar Author-X-Name-Last: Muralidharan Title: Empirical Bayes Mean Estimation With Nonparametric Errors Via Order Statistic Regression on Replicated Data Abstract: We study empirical Bayes estimation of the effect sizes of N units from K noisy observations on each unit. We show that it is possible to achieve near-Bayes optimal mean squared error, without any assumptions or knowledge about the effect size distribution or the noise. The noise distribution can be heteroscedastic and vary arbitrarily from unit to unit. Our proposal, which we call Aurora, leverages the replication inherent in the K observations per unit and recasts the effect size estimation problem as a general regression problem. Aurora with linear regression provably matches the performance of a wide array of estimators including the sample mean, the trimmed mean, the sample median, as well as James-Stein shrunk versions thereof. Aurora automates effect size estimation for Internet-scale datasets, as we demonstrate on data from a large technology firm. Journal: Journal of the American Statistical Association Pages: 987-999 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1967164 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1967164 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:987-999 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1981913_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Chaonan Jiang Author-X-Name-First: Chaonan Author-X-Name-Last: Jiang Author-Name: Davide La Vecchia Author-X-Name-First: Davide La Author-X-Name-Last: Vecchia Author-Name: Elvezio Ronchetti Author-X-Name-First: Elvezio Author-X-Name-Last: Ronchetti Author-Name: Olivier Scaillet Author-X-Name-First: Olivier Author-X-Name-Last: Scaillet Title: Saddlepoint Approximations for Spatial Panel Data Models Abstract: We develop new higher-order asymptotic techniques for the Gaussian maximum likelihood estimator in a spatial panel data model, with fixed effects, time-varying covariates, and spatially correlated errors. Our saddlepoint density and tail area approximation feature relative error of order O(1/(n(T−1))) with n being the cross-sectional dimension and T the time-series dimension. The main theoretical tool is the tilted-Edgeworth technique in a nonidentically distributed setting. The density approximation is always nonnegative, does not need resampling, and is accurate in the tails. Monte Carlo experiments on density approximation and testing in the presence of nuisance parameters illustrate the good performance of our approximation over first-order asymptotics and Edgeworth expansion. An empirical application to the investment–saving relationship in OECD (Organisation for Economic Co-operation and Development) countries shows disagreement between testing results based on the first-order asymptotics and saddlepoint techniques. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 1164-1175 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1981913 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1981913 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1164-1175 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1990767_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Xinhe Wang Author-X-Name-First: Xinhe Author-X-Name-Last: Wang Author-Name: Tingyu Wang Author-X-Name-First: Tingyu Author-X-Name-Last: Wang Author-Name: Hanzhong Liu Author-X-Name-First: Hanzhong Author-X-Name-Last: Liu Title: Rerandomization in Stratified Randomized Experiments Abstract: Stratification and rerandomization are two well-known methods used in randomized experiments for balancing the baseline covariates. Renowned scholars in experimental design have recommended combining these two methods; however, limited studies have addressed the statistical properties of this combination. This article proposes two rerandomization methods to be used in stratified randomized experiments, based on the overall and stratum-specific Mahalanobis distances. The first method is applicable for nearly arbitrary numbers of strata, strata sizes, and stratum-specific proportions of the treated units. The second method, which is generally more efficient than the first method, is suitable for situations in which the number of strata is fixed with their sizes tending to infinity. Under the randomization inference framework, we obtain the asymptotic distributions of estimators used in these methods and the formulas of variance reduction when compared to stratified randomization. Our analysis does not require any modeling assumption regarding the potential outcomes. Moreover, we provide asymptotically conservative variance estimators and confidence intervals for the average treatment effect. The advantages of the proposed methods are exhibited through an extensive simulation study and a real-data example. Journal: Journal of the American Statistical Association Pages: 1295-1304 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1990767 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1990767 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1295-1304 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1955691_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Iván Díaz Author-X-Name-First: Iván Author-X-Name-Last: Díaz Author-Name: Nicholas Williams Author-X-Name-First: Nicholas Author-X-Name-Last: Williams Author-Name: Katherine L. Hoffman Author-X-Name-First: Katherine L. Author-X-Name-Last: Hoffman Author-Name: Edward J. Schenck Author-X-Name-First: Edward J. Author-X-Name-Last: Schenck Title: Nonparametric Causal Effects Based on Longitudinal Modified Treatment Policies Abstract: Most causal inference methods consider counterfactual variables under interventions that set the exposure to a fixed value. With continuous or multi-valued treatments or exposures, such counterfactuals may be of little practical interest because no feasible intervention can be implemented that would bring them about. Longitudinal modified treatment policies (LMTPs) are a recently developed nonparametric alternative that yield effects of immediate practical relevance with an interpretation in terms of meaningful interventions such as reducing or increasing the exposure by a given amount. LMTPs also have the advantage that they can be designed to satisfy the positivity assumption required for causal inference. We present a novel sequential regression formula that identifies the LMTP causal effect, study properties of the LMTP statistical estimand such as the efficient influence function and the efficiency bound, and propose four different estimators. Two of our estimators are efficient, and one is sequentially doubly robust in the sense that it is consistent if, for each time point, either an outcome regression or a treatment mechanism is consistently estimated. We perform numerical studies of the estimators, and present the results of our motivating study on hypoxemia and mortality in intubated Intensive Care Unit (ICU) patients. Software implementing our methods is provided in the form of the open source R package lmtp freely available on GitHub (https://github.com/nt-williams/lmtp) and CRAN. Journal: Journal of the American Statistical Association Pages: 846-857 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1955691 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1955691 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:846-857 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1981337_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: B. Zhang Author-X-Name-First: B. Author-X-Name-Last: Zhang Author-Name: D. S. Small Author-X-Name-First: D. S. Author-X-Name-Last: Small Author-Name: K. B. Lasater Author-X-Name-First: K. B. Author-X-Name-Last: Lasater Author-Name: M. McHugh Author-X-Name-First: M. Author-X-Name-Last: McHugh Author-Name: J. H. Silber Author-X-Name-First: J. H. Author-X-Name-Last: Silber Author-Name: P. R. Rosenbaum Author-X-Name-First: P. R. Author-X-Name-Last: Rosenbaum Title: Matching One Sample According to Two Criteria in Observational Studies Abstract: Multivariate matching has two goals (i) to construct treated and control groups that have similar distributions of observed covariates, and (ii) to produce matched pairs or sets that are homogeneous in a few key covariates. When there are only a few binary covariates, both goals may be achieved by matching exactly for these few covariates. Commonly, however, there are many covariates, so goals (i) and (ii) come apart, and must be achieved by different means. As is also true in a randomized experiment, similar distributions can be achieved for a high-dimensional covariate, but close pairs can be achieved for only a few covariates. We introduce a new polynomial-time method for achieving both goals that substantially generalizes several existing methods; in particular, it can minimize the earthmover distance between two marginal distributions. The method involves minimum cost flow optimization in a network built around a tripartite graph, unlike the usual network built around a bipartite graph. In the tripartite graph, treated subjects appear twice, on the far left and the far right, with controls sandwiched between them, and efforts to balance covariates are represented on the right, while efforts to find close individual pairs are represented on the left. In this way, the two efforts may be pursued simultaneously without conflict. The method is applied to our on-going study in the Medicare population of the relationship between superior nursing and sepsis mortality. The match2C package in R implements the method. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1140-1151 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1981337 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1981337 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1140-1151 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1979010_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Jiawei Zhang Author-X-Name-First: Jiawei Author-X-Name-Last: Zhang Author-Name: Jie Ding Author-X-Name-First: Jie Author-X-Name-Last: Ding Author-Name: Yuhong Yang Author-X-Name-First: Yuhong Author-X-Name-Last: Yang Title: Is a Classification Procedure Good Enough?—A Goodness-of-Fit Assessment Tool for Classification Learning Abstract: In recent years, many nontraditional classification methods, such as random forest, boosting, and neural network, have been widely used in applications. Their performance is typically measured in terms of classification accuracy. While the classification error rate and the like are important, they do not address a fundamental question: Is the classification method underfitted? To our best knowledge, there is no existing method that can assess the goodness of fit of a general classification procedure. Indeed, the lack of a parametric assumption makes it challenging to construct proper tests. To overcome this difficulty, we propose a methodology called BAGofT that splits the data into a training set and a validation set. First, the classification procedure to assess is applied to the training set, which is also used to adaptively find a data grouping that reveals the most severe regions of underfitting. Then, based on this grouping, we calculate a test statistic by comparing the estimated success probabilities and the actual observed responses from the validation set. The data splitting guarantees that the size of the test is controlled under the null hypothesis, and the power of the test goes to one as the sample size increases under the alternative hypothesis. For testing parametric classification models, the BAGofT has a broader scope than the existing methods since it is not restricted to specific parametric models (e.g., logistic regression). Extensive simulation studies show the utility of the BAGofT when assessing general classification procedures and its strengths over some existing methods when testing parametric classification models. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1115-1125 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1979010 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1979010 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1115-1125 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1978468_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Matthew Blackwell Author-X-Name-First: Matthew Author-X-Name-Last: Blackwell Author-Name: Nicole E. Pashley Author-X-Name-First: Nicole E. Author-X-Name-Last: Pashley Title: Noncompliance and Instrumental Variables for 2K Factorial Experiments Abstract: Factorial experiments are widely used to assess the marginal, joint, and interactive effects of multiple concurrent factors. While a robust literature covers the design and analysis of these experiments, there is less work on how to handle treatment noncompliance in this setting. To fill this gap, we introduce a new methodology that uses the potential outcomes framework for analyzing 2K factorial experiments with noncompliance on any number of factors. This framework builds on and extends the literature on both instrumental variables and factorial experiments in several ways. First, we define novel, complier-specific quantities of interest for this setting and show how to generalize key instrumental variables assumptions. Second, we show how partial compliance across factors gives researchers a choice over different types of compliers to target in estimation. Third, we show how to conduct inference for these new estimands from both the finite-population and superpopulation asymptotic perspectives. Finally, we illustrate these techniques by applying them to a field experiment on the effectiveness of different forms of get-out-the-vote canvassing. New easy-to-use, open-source software implements the methodology. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1102-1114 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1978468 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1978468 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1102-1114 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2183128_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yang Ni Author-X-Name-First: Yang Author-X-Name-Last: Ni Title: Handbook of Bayesian Variable Selection Journal: Journal of the American Statistical Association Pages: 1449-1450 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2023.2183128 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2183128 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1449-1450 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1984926_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Zihao Yang Author-X-Name-First: Zihao Author-X-Name-Last: Yang Author-Name: Tianyi Qu Author-X-Name-First: Tianyi Author-X-Name-Last: Qu Author-Name: Xinran Li Author-X-Name-First: Xinran Author-X-Name-Last: Li Title: Rejective Sampling, Rerandomization, and Regression Adjustment in Survey Experiments Abstract: Classical randomized experiments, equipped with randomization-based inference, provide assumption-free inference for treatment effects. They have been the gold standard for drawing causal inference and provide excellent internal validity. However, they have also been criticized for questionable external validity, in the sense that the conclusion may not generalize well to a larger population. The randomized survey experiment is a design tool that can help mitigate this concern, by randomly selecting the experimental units from the target population of interest. However, as pointed out by Morgan and Rubin, chance imbalances often exist in covariate distributions between different treatment groups even under completely randomized experiments. Not surprisingly, such covariate imbalances also occur in randomized survey experiments. Furthermore, the covariate imbalances happen not only between different treatment groups, but also between the sampled experimental units and the overall population of interest. In this article, we propose a two-stage rerandomization design that can actively avoid undesirable covariate imbalances at both the sampling and treatment assignment stages. We further develop asymptotic theory for rerandomized survey experiments, demonstrating that rerandomization provides better covariate balance, more precise treatment effect estimators, and shorter large-sample confidence intervals. We also propose covariate adjustment to deal with remaining covariate imbalances after rerandomization, showing that it can further improve both the sampling and estimated precision. Our work allows general relationship among covariates at the sampling, treatment assignment and analysis stages, and generalizes both rerandomization in classical randomized experiments and rejective sampling in survey sampling. Journal: Journal of the American Statistical Association Pages: 1207-1221 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1984926 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1984926 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1207-1221 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1999818_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Wei Liu Author-X-Name-First: Wei Author-X-Name-Last: Liu Author-Name: Huazhen Lin Author-X-Name-First: Huazhen Author-X-Name-Last: Lin Author-Name: Shurong Zheng Author-X-Name-First: Shurong Author-X-Name-Last: Zheng Author-Name: Jin Liu Author-X-Name-First: Jin Author-X-Name-Last: Liu Title: Generalized Factor Model for Ultra-High Dimensional Correlated Variables with Mixed Types Abstract: As high-dimensional data measured with mixed-type variables gradually become prevalent, it is particularly appealing to represent those mixed-type high-dimensional data using a much smaller set of so-called factors. Due to the limitation of the existing methods for factor analysis that deal with only continuous variables, in this article, we develop a generalized factor model, a corresponding algorithm and theory for ultra-high dimensional mixed types of variables where both the sample size n and variable dimension p could diverge to infinity. Specifically, to solve the computational problem arising from the non-linearity and mixed types, we develop a two-step algorithm so that each update can be carried out in parallel across variables and samples by using an existing package. Theoretically, we establish the rate of convergence for the estimators of factors and loadings in the presence of nonlinear structure accompanied with mixed-type variables when both n and p diverge to infinity. Moreover, since the correct specification of the number of factors is crucial to both the theoretical and the empirical validity of factor models, we also develop a criterion based on a penalized loss to consistently estimate the number of factors under the framework of a generalized factor model. To demonstrate the advantages of the proposed method over the existing ones, we conducted extensive simulation studies and also applied it to the analysis of the NFBC1966 dataset and a cardiac arrhythmia dataset, resulting in more predictive and interpretable estimators for loadings and factors than the existing factor model. Journal: Journal of the American Statistical Association Pages: 1385-1401 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1999818 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1999818 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1385-1401 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_1955690_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Xiwei Tang Author-X-Name-First: Xiwei Author-X-Name-Last: Tang Author-Name: Lexin Li Author-X-Name-First: Lexin Author-X-Name-Last: Li Title: Multivariate Temporal Point Process Regression Abstract: Point process modeling is gaining increasing attention, as point process type data are emerging in a large variety of scientific applications. In this article, motivated by a neuronal spike trains study, we propose a novel point process regression model, where both the response and the predictor can be a high-dimensional point process. We model the predictor effects through the conditional intensities using a set of basis transferring functions in a convolutional fashion. We organize the corresponding transferring coefficients in the form of a three-way tensor, then impose the low-rank, sparsity, and subgroup structures on this coefficient tensor. These structures help reduce the dimensionality, integrate information across different individual processes, and facilitate the interpretation. We develop a highly scalable optimization algorithm for parameter estimation. We derive the large sample error bound for the recovered coefficient tensor, and establish the subgroup identification consistency, while allowing the dimension of the multivariate point process to diverge. We demonstrate the efficacy of our method through both simulations and a cross-area neuronal spike trains analysis in a sensory cortex study. Journal: Journal of the American Statistical Association Pages: 830-845 Issue: 542 Volume: 118 Year: 2023 Month: 4 X-DOI: 10.1080/01621459.2021.1955690 File-URL: http://hdl.handle.net/10.1080/01621459.2021.1955690 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:830-845 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2224409_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Mark N. Harris Author-X-Name-First: Mark N. Author-X-Name-Last: Harris Title: Modern Applied Regressions: Bayesian and Frequentist Analysis of Categorical and Limited Response variables with R and Stan Journal: Journal of the American Statistical Association Pages: 2209-2211 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2023.2224409 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2224409 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2209-2211 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2003201_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Leo L. Duan Author-X-Name-First: Leo L. Author-X-Name-Last: Duan Title: Transport Monte Carlo: High-Accuracy Posterior Approximation via Random Transport Abstract: In Bayesian applications, there is a huge interest in rapid and accurate estimation of the posterior distribution, particularly for high dimensional or hierarchical models. In this article, we propose to use optimization to solve for a joint distribution (random transport plan) between two random variables, θ from the posterior distribution and β from the simple multivariate uniform. Specifically, we obtain an approximate estimate of the conditional distribution Π(β|θ) as an infinite mixture of simple location-scale changes; applying the Bayes’ theorem, Π(θ|β) can be sampled as one of the reversed transforms from the uniform, with the weight proportional to the posterior density/mass function. This produces independent random samples with high approximation accuracy, as well as nice theoretical guarantees. Our method shows compelling advantages in performance and accuracy, compared to the state-of-the-art Markov chain Monte Carlo and approximations such as variational Bayes and normalizing flow. We illustrate this approach via several challenging applications, such as sampling from multi-modal distribution, estimating sparse signals in high dimension, and soft-thresholding of a graph with a prior on the degrees. Supplementary materials for this article, including the source code and additional comparison with popular alternative algorithms are available on the journal website. Journal: Journal of the American Statistical Association Pages: 1659-1670 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2003201 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2003201 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1659-1670 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2013851_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Xiaowu Dai Author-X-Name-First: Xiaowu Author-X-Name-Last: Dai Author-Name: Lexin Li Author-X-Name-First: Lexin Author-X-Name-Last: Li Title: Orthogonalized Kernel Debiased Machine Learning for Multimodal Data Analysis Abstract: Multimodal imaging has transformed neuroscience research. While it presents unprecedented opportunities, it also imposes serious challenges. Particularly, it is difficult to combine the merits of the interpretability attributed to a simple association model with the flexibility achieved by a highly adaptive nonlinear model. In this article, we propose an orthogonalized kernel debiased machine learning approach, which is built upon the Neyman orthogonality and a form of decomposition orthogonality, for multimodal data analysis. We target the setting that naturally arises in almost all multimodal studies, where there is a primary modality of interest, plus additional auxiliary modalities. We establish the root-N-consistency and asymptotic normality of the estimated primary parameter, the semi-parametric estimation efficiency, and the asymptotic validity of the confidence band of the predicted primary modality effect. Our proposal enjoys, to a good extent, both model interpretability and model flexibility. It is also considerably different from the existing statistical methods for multimodal data integration, as well as the orthogonality-based methods for high-dimensional inferences. We demonstrate the efficacy of our method through both simulations and an application to a multimodal neuroimaging study of Alzheimer’s disease. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1796-1810 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2013851 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2013851 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1796-1810 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2044333_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Sai Li Author-X-Name-First: Sai Author-X-Name-Last: Li Author-Name: T. Tony Cai Author-X-Name-First: T. Tony Author-X-Name-Last: Cai Author-Name: Hongzhe Li Author-X-Name-First: Hongzhe Author-X-Name-Last: Li Title: Transfer Learning in Large-Scale Gaussian Graphical Models with False Discovery Rate Control Abstract: Transfer learning for high-dimensional Gaussian graphical models (GGMs) is studied. The target GGM is estimated by incorporating the data from similar and related auxiliary studies, where the similarity between the target graph and each auxiliary graph is characterized by the sparsity of a divergence matrix. An estimation algorithm, Trans-CLIME, is proposed and shown to attain a faster convergence rate than the minimax rate in the single-task setting. Furthermore, we introduce a universal debiasing method that can be coupled with a range of initial graph estimators and can be analytically computed in one step. A debiased Trans-CLIME estimator is then constructed and is shown to be element-wise asymptotically normal. This fact is used to construct a multiple testing procedure for edge detection with false discovery rate control. The proposed estimation and multiple testing procedures demonstrate superior numerical performance in simulations and are applied to infer the gene networks in a target brain tissue by leveraging the gene expressions from multiple other brain tissues. A significant decrease in prediction errors and a significant increase in power for link detection are observed. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2171-2183 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2022.2044333 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2044333 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2171-2183 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2023550_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yen-Chi Chen Author-X-Name-First: Yen-Chi Author-X-Name-Last: Chen Title: Statistical Inference with Local Optima Abstract: We study the statistical properties of an estimator derived by applying a gradient ascent method with multiple initializations to a multi-modal likelihood function. We derive the population quantity that is the target of this estimator and study the properties of confidence intervals (CIs) constructed from asymptotic normality and the bootstrap approach. In particular, we analyze the coverage deficiency due to finite number of random initializations. We also investigate the CIs by inverting the likelihood ratio test, the score test, and the Wald test, and we show that the resulting CIs may be very different. We propose a two-sample test procedure even when the maximum likelihood estimator is intractable. In addition, we analyze the performance of the EM algorithm under random initializations and derive the coverage of a CI with a finite number of initializations. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1940-1952 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2023550 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2023550 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1940-1952 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2025815_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Lizhen Nie Author-X-Name-First: Lizhen Author-X-Name-Last: Nie Author-Name: Veronika Ročková Author-X-Name-First: Veronika Author-X-Name-Last: Ročková Title: Bayesian Bootstrap Spike-and-Slab LASSO Abstract: The impracticality of posterior sampling has prevented the widespread adoption of spike-and-slab priors in high-dimensional applications. To alleviate the computational burden, optimization strategies have been proposed that quickly find local posterior modes. Trading off uncertainty quantification for computational speed, these strategies have enabled spike-and-slab deployments at scales that would be previously unfeasible. We build on one recent development in this strand of work: the Spike-and-Slab LASSO procedure. Instead of optimization, however, we explore multiple avenues for posterior sampling, some traditional and some new. Intrigued by the speed of Spike-and-Slab LASSO mode detection, we explore the possibility of sampling from an approximate posterior by performing MAP optimization on many independently perturbed datasets. To this end, we explore Bayesian bootstrap ideas and introduce a new class of jittered Spike-and-Slab LASSO priors with random shrinkage targets. These priors are a key constituent of the Bayesian Bootstrap Spike-and-Slab LASSO (BB-SSL) method proposed here. BB-SSL turns fast optimization into approximate posterior sampling. Beyond its scalability, we show that BB-SSL has a strong theoretical support. Indeed, we find that the induced pseudo-posteriors contract around the truth at a near-optimal rate in sparse normal-means and in high-dimensional regression. We compare our algorithm to the traditional Stochastic Search Variable Selection (under Laplace priors) as well as many state-of-the-art methods for shrinkage priors. We show, both in simulations and on real data, that our method fares very well in these comparisons, often providing substantial computational gains. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2013-2028 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2022.2025815 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2025815 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2013-2028 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2013242_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Serge Aleshin-Guendel Author-X-Name-First: Serge Author-X-Name-Last: Aleshin-Guendel Author-Name: Mauricio Sadinle Author-X-Name-First: Mauricio Author-X-Name-Last: Sadinle Title: Multifile Partitioning for Record Linkage and Duplicate Detection Abstract: Merging datafiles containing information on overlapping sets of entities is a challenging task in the absence of unique identifiers, and is further complicated when some entities are duplicated in the datafiles. Most approaches to this problem have focused on linking two files assumed to be free of duplicates, or on detecting which records in a single file are duplicates. However, it is common in practice to encounter scenarios that fit somewhere in between or beyond these two settings. We propose a Bayesian approach for the general setting of multifile record linkage and duplicate detection. We use a novel partition representation to propose a structured prior for partitions that can incorporate prior information about the data collection processes of the datafiles in a flexible manner, and extend previous models for comparison data to accommodate the multifile setting. We also introduce a family of loss functions to derive Bayes estimates of partitions that allow uncertain portions of the partitions to be left unresolved. The performance of our proposed methodology is explored through extensive simulations. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1786-1795 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2013242 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2013242 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1786-1795 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2164287_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Chuan Tian Author-X-Name-First: Chuan Author-X-Name-Last: Tian Author-Name: Duo Jiang Author-X-Name-First: Duo Author-X-Name-Last: Jiang Author-Name: Austin Hammer Author-X-Name-First: Austin Author-X-Name-Last: Hammer Author-Name: Thomas Sharpton Author-X-Name-First: Thomas Author-X-Name-Last: Sharpton Author-Name: Yuan Jiang Author-X-Name-First: Yuan Author-X-Name-Last: Jiang Title: Compositional Graphical Lasso Resolves the Impact of Parasitic Infection on Gut Microbial Interaction Networks in a Zebrafish Model Abstract: Understanding how microbes interact with each other is key to revealing the underlying role that microorganisms play in the host or environment and to identifying microorganisms as an agent that can potentially alter the host or environment. For example, understanding how the microbial interactions associate with parasitic infection can help resolve potential drug or diagnostic test for parasitic infection. To unravel the microbial interactions, existing tools often rely on graphical models to infer the conditional dependence of microbial abundances to represent their interactions. However, current methods do not simultaneously account for the discreteness, compositionality, and heterogeneity inherent to microbiome data. Thus, we build a new approach called “compositional graphical lasso” upon existing tools by incorporating the above characteristics into the graphical model explicitly. We illustrate the advantage of compositional graphical lasso over current methods under a variety of simulation scenarios and on a benchmark study, the Tara Oceans Project. Moreover, we present our results from the analysis of a dataset from the Zebrafish Parasite Infection Study, which aims to gain insight into how the gut microbiome and parasite burden covary during infection, thus, uncovering novel putative methods of disrupting parasite success. Our approach identifies changes in interaction degree between infected and uninfected individuals for three taxa, Photobacterium, Gemmobacter, and Paucibacter, which are inversely predicted by other methods. Further investigation of these method-specific taxa interaction changes reveals their biological plausibility. In particular, we speculate on the potential pathobiotic roles of Photobacterium and Gemmobacter in the zebrafish gut, and the potential probiotic role of Paucibacter. Collectively, our analyses demonstrate that compositional graphical lasso provides a powerful means of accurately resolving interactions between microbiota and can thus drive novel biological discovery. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1500-1514 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2022.2164287 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2164287 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1500-1514 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2165929_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Ian Laga Author-X-Name-First: Ian Author-X-Name-Last: Laga Author-Name: Le Bao Author-X-Name-First: Le Author-X-Name-Last: Bao Author-Name: Xiaoyue Niu Author-X-Name-First: Xiaoyue Author-X-Name-Last: Niu Title: A Correlated Network Scale-Up Model: Finding the Connection Between Subpopulations Abstract: Aggregated Relational Data (ARD), formed from “How many X’s do you know?” questions, is a powerful tool for learning important network characteristics with incomplete network data. Compared to traditional survey methods, ARD is attractive as it does not require a sample from the target population and does not ask respondents to self-reveal their own status. This is helpful for studying hard-to-reach populations like female sex workers who may be hesitant to reveal their status. From December 2008 to February 2009, the Kiev International Institute of Sociology (KIIS) collected ARD from 10,866 respondents to estimate the size of HIV-related groups in Ukraine. To analyze this data, we propose a new ARD model which incorporates respondent and group covariates in a regression framework and includes a bias term that is correlated between groups. We also introduce a new scaling procedure using the correlation structure to further reduce biases. The resulting size estimates of those most-at-risk of HIV infection can improve the HIV response efficiency in Ukraine. Additionally, the proposed model allows us to better understand two network features without the full network data: (a) What characteristics affect who respondents know, and (b) How is knowing someone from one group related to knowing people from other groups. These features can allow researchers to better recruit marginalized individuals into the prevention and treatment programs. Our proposed model and several existing NSUM models are implemented in the networkscaleup R package. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1515-1524 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2023.2165929 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2165929 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1515-1524 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2039671_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Xiaowu Dai Author-X-Name-First: Xiaowu Author-X-Name-Last: Dai Author-Name: Xiang Lyu Author-X-Name-First: Xiang Author-X-Name-Last: Lyu Author-Name: Lexin Li Author-X-Name-First: Lexin Author-X-Name-Last: Li Title: Kernel Knockoffs Selection for Nonparametric Additive Models Abstract: Thanks to its fine balance between model flexibility and interpretability, the nonparametric additive model has been widely used, and variable selection for this type of model has been frequently studied. However, none of the existing solutions can control the false discovery rate (FDR) unless the sample size tends to infinity. The knockoff framework is a recent proposal that can address this issue, but few knockoff solutions are directly applicable to nonparametric models. In this article, we propose a novel kernel knockoffs selection procedure for the nonparametric additive model. We integrate three key components: the knockoffs, the subsampling for stability, and the random feature mapping for nonparametric function approximation. We show that the proposed method is guaranteed to control the FDR for any sample size, and achieves a power that approaches one as the sample size tends to infinity. We demonstrate the efficacy of our method through intensive simulations and comparisons with the alternative solutions. Our proposal thus, makes useful contributions to the methodology of nonparametric variable selection, FDR-based inference, as well as knockoffs. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2158-2170 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2022.2039671 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2039671 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2158-2170 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2223689_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Qingzhao Zhang Author-X-Name-First: Qingzhao Author-X-Name-Last: Zhang Author-Name: Shuangge Ma Author-X-Name-First: Shuangge Author-X-Name-Last: Ma Title: Comment on “A Scale-Free Approach for False Discovery Rate Control in Generalized Linear Models” by Chengguang Dai, Buyu Lin, Xin Xing, and Jun S. Liu Journal: Journal of the American Statistical Association Pages: 1566-1568 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2023.2223689 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2223689 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1566-1568 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2011298_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Xiongtao Dai Author-X-Name-First: Xiongtao Author-X-Name-Last: Dai Author-Name: Sara Lopez-Pintado Author-X-Name-First: Sara Author-X-Name-Last: Lopez-Pintado Author-Name: Author-X-Name-First: Author-X-Name-Last: Title: Tukey’s Depth for Object Data Abstract: We develop a novel exploratory tool for non-Euclidean object data based on data depth, extending celebrated Tukey’s depth for Euclidean data. The proposed metric halfspace depth, applicable to data objects in a general metric space, assigns to data points depth values that characterize the centrality of these points with respect to the distribution and provides an interpretable center-outward ranking. Desirable theoretical properties that generalize standard depth properties postulated for Euclidean data are established for the metric halfspace depth. The depth median, defined as the deepest point, is shown to have high robustness as a location descriptor both in theory and in simulation. We propose an efficient algorithm to approximate the metric halfspace depth and illustrate its ability to adapt to the intrinsic data geometry. The metric halfspace depth was applied to an Alzheimer’s disease study, revealing group differences in the brain connectivity, modeled as covariance matrices, for subjects in different stages of dementia. Based on phylogenetic trees of seven pathogenic parasites, our proposed metric halfspace depth was also used to construct a meaningful consensus estimate of the evolutionary history and to identify potential outlier trees. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1760-1772 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2011298 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2011298 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1760-1772 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2005608_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yubai Yuan Author-X-Name-First: Yubai Author-X-Name-Last: Yuan Author-Name: Annie Qu Author-X-Name-First: Annie Author-X-Name-Last: Qu Title: High-Order Joint Embedding for Multi-Level Link Prediction Abstract: Link prediction infers potential links from observed networks, and is one of the essential problems in network analyses. In contrast to traditional graph representation modeling which only predicts two-way pairwise relations, we propose a novel tensor-based joint network embedding approach on simultaneously encoding pairwise links and hyperlinks onto a latent space, which captures the dependency between pairwise and multi-way links in inferring potential unobserved hyperlinks. The major advantage of the proposed embedding procedure is that it incorporates both the pairwise relationships and subgroup-wise structure among nodes to capture richer network information. In addition, the proposed method introduces a hierarchical dependency among links to infer potential hyperlinks, and leads to better link prediction. In theory we establish the estimation consistency for the proposed embedding approach, and provide a faster convergence rate compared to link prediction using pairwise links or hyperlinks only. Numerical studies on both simulation settings and Facebook ego-networks indicate that the proposed method improves both hyperlink and pairwise link prediction accuracy compared to existing link prediction algorithms. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1692-1706 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2005608 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2005608 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1692-1706 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2231056_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: The Editors Title: Bootstrap Prediction Bands for Functional Time Series Journal: Journal of the American Statistical Association Pages: 2211-2211 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2023.2231056 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2231056 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2211-2211 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2019045_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yujia Deng Author-X-Name-First: Yujia Author-X-Name-Last: Deng Author-Name: Yubai Yuan Author-X-Name-First: Yubai Author-X-Name-Last: Yuan Author-Name: Haoda Fu Author-X-Name-First: Haoda Author-X-Name-Last: Fu Author-Name: Annie Qu Author-X-Name-First: Annie Author-X-Name-Last: Qu Title: Query-Augmented Active Metric Learning Abstract: In this article, we propose an active metric learning method for clustering with pairwise constraints. The proposed method actively queries the label of informative instance pairs, while estimating underlying metrics by incorporating unlabeled instance pairs, which leads to a more accurate and efficient clustering process. In particular, we augment the queried constraints by generating more pairwise labels to provide additional information in learning a metric to enhance clustering performance. Furthermore, we increase the robustness of metric learning by updating the learned metric sequentially and penalizing the irrelevant features adaptively. In addition, we propose a novel active query strategy that evaluates the information gain of instance pairs more accurately by incorporating the neighborhood structure, which improves clustering efficiency without extra labeling cost. In theory, we provide a tighter error bound of the proposed metric learning method using augmented queries compared with methods using existing constraints only. Furthermore, we also investigate the improvement using the active query strategy instead of random selection. Numerical studies on simulation settings and real datasets indicate that the proposed method is especially advantageous when the signal-to-noise ratio between significant features and irrelevant features is low. Journal: Journal of the American Statistical Association Pages: 1862-1875 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2019045 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2019045 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1862-1875 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2026778_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Lan Luo Author-X-Name-First: Lan Author-X-Name-Last: Luo Author-Name: Ling Zhou Author-X-Name-First: Ling Author-X-Name-Last: Zhou Author-Name: Peter X.-K. Song Author-X-Name-First: Peter X.-K. Author-X-Name-Last: Song Title: Real-Time Regression Analysis of Streaming Clustered Data With Possible Abnormal Data Batches Abstract: This article develops an incremental learning algorithm based on quadratic inference function (QIF) to analyze streaming datasets with correlated outcomes such as longitudinal data and clustered data. We propose a renewable QIF (RenewQIF) method within a paradigm of renewable estimation and incremental inference, in which parameter estimates are recursively renewed with current data and summary statistics of historical data, but with no use of any historical subject-level raw data. We compare our renewable estimation method with both offline QIF and offline generalized estimating equations (GEE) approach that process the entire cumulative subject-level data all together, and show theoretically and numerically that our renewable procedure enjoys statistical and computational efficiency. We also propose an approach to diagnose the homogeneity assumption of regression coefficients via a sequential goodness-of-fit test as a screening procedure on occurrences of abnormal data batches. We implement the proposed methodology by expanding existing Spark’s Lambda architecture for the operation of statistical inference and data quality diagnosis. We illustrate the proposed methodology by extensive simulation studies and an analysis of streaming car crash datasets from the National Automotive Sampling System-Crashworthiness Data System (NASS CDS). Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2029-2044 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2022.2026778 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2026778 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2029-2044 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2020126_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Rui Miao Author-X-Name-First: Rui Author-X-Name-Last: Miao Author-Name: Xiaoke Zhang Author-X-Name-First: Xiaoke Author-X-Name-Last: Zhang Author-Name: Raymond K. W. Wong Author-X-Name-First: Raymond K. W. Author-X-Name-Last: Wong Title: A Wavelet-Based Independence Test for Functional Data With an Application to MEG Functional Connectivity Abstract: Measuring and testing the dependency between multiple random functions is often an important task in functional data analysis. In the literature, a model-based method relies on a model which is subject to the risk of model misspecification, while a model-free method only provides a correlation measure which is inadequate to test independence. In this paper, we adopt the Hilbert–Schmidt Independence Criterion (HSIC) to measure the dependency between two random functions. We develop a two-step procedure by first pre-smoothing each function based on its discrete and noisy measurements and then applying the HSIC to recovered functions. To ensure the compatibility between the two steps such that the effect of the pre-smoothing error on the subsequent HSIC is asymptotically negligible when the data are densely measured, we propose a new wavelet thresholding method for pre-smoothing and to use Besov-norm-induced kernels for HSIC. We also provide the corresponding asymptotic analysis. The superior numerical performance of the proposed method over existing ones is demonstrated in a simulation study. Moreover, in a magnetoencephalography (MEG) data application, the functional connectivity patterns identified by the proposed method are more anatomically interpretable than those by existing methods. Journal: Journal of the American Statistical Association Pages: 1876-1889 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2020126 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2020126 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1876-1889 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2183127_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Haoran Xue Author-X-Name-First: Haoran Author-X-Name-Last: Xue Author-Name: Xiaotong Shen Author-X-Name-First: Xiaotong Author-X-Name-Last: Shen Author-Name: Wei Pan Author-X-Name-First: Wei Author-X-Name-Last: Pan Title: Causal Inference in Transcriptome-Wide Association Studies with Invalid Instruments and GWAS Summary Data Abstract: Transcriptome-Wide Association Studies (TWAS) have recently emerged as a popular tool to discover (putative) causal genes by integrating an outcome GWAS dataset with another gene expression/transcriptome GWAS (called eQTL) dataset. In our motivating and target application, we’d like to identify causal genes for Low-Density Lipoprotein cholesterol (LDL), which is crucial for developing new treatments for hyperlipidemia and cardiovascular diseases. The statistical principle underlying TWAS is (two-sample) two-stage least squares (2SLS) using multiple correlated SNPs as instrumental variables (IVs); it is closely related to typical (two-sample) Mendelian randomization (MR) using independent SNPs as IVs, which is expected to be impractical and lower-powered for TWAS (and some other) applications. However, often some of the SNPs used may not be valid IVs, for example, due to the widespread pleiotropy of their direct effects on the outcome not mediated through the gene of interest, leading to false conclusions by TWAS (or MR). Building on recent advances in sparse regression, we propose a robust and efficient inferential method to account for both hidden confounding and some invalid IVs via two-stage constrained maximum likelihood (2ScML), an extension of 2SLS. We first develop the proposed method with individual-level data, then extend it both theoretically and computationally to GWAS summary data for the most popular two-sample TWAS design, to which almost all existing robust IV regression methods are however not applicable. We show that the proposed method achieves asymptotically valid statistical inference on causal effects, demonstrating its wider applicability and superior finite-sample performance over the standard 2SLS/TWAS (and MR). We apply the methods to identify putative causal genes for LDL by integrating large-scale lipid GWAS summary data with eQTL data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1525-1537 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2023.2183127 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2183127 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1525-1537 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2165930_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Chenguang Dai Author-X-Name-First: Chenguang Author-X-Name-Last: Dai Author-Name: Buyu Lin Author-X-Name-First: Buyu Author-X-Name-Last: Lin Author-Name: Xin Xing Author-X-Name-First: Xin Author-X-Name-Last: Xing Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Title: A Scale-Free Approach for False Discovery Rate Control in Generalized Linear Models Abstract: The Generalized Linear Model (GLM) has been widely used in practice to model counts or other types of non-Gaussian data. This article introduces a framework for feature selection in the GLM that can achieve robust False Discovery Rate (FDR) control. The main idea is to construct a mirror statistic based on data perturbation to measure the importance of each feature. FDR control is achieved by taking advantage of the mirror statistic’s property that its sampling distribution is (asymptotically) symmetric about zero for any null feature. In the moderate-dimensional setting, that is, p/n→κ∈(0,1), we construct the mirror statistic based on the maximum likelihood estimation. In the high-dimensional setting, that is, p≫n, we use the debiased Lasso to build the mirror statistic. The proposed methodology is scale-free as it only hinges on the symmetry of the mirror statistic, thus, can be more robust in finite-sample cases compared to existing methods. Both simulation results and a real data application show that the proposed methods are capable of controlling the FDR and are often more powerful than existing methods including the Benjamini-Hochberg procedure and the knockoff filter. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1551-1565 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2023.2165930 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2165930 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1551-1565 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2003202_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Zhibo Cai Author-X-Name-First: Zhibo Author-X-Name-Last: Cai Author-Name: Yingcun Xia Author-X-Name-First: Yingcun Author-X-Name-Last: Xia Author-Name: Weiqiang Hang Author-X-Name-First: Weiqiang Author-X-Name-Last: Hang Title: An Outer-Product-of-Gradient Approach to Dimension Reduction and its Application to Classification in High Dimensional Space Abstract: Sufficient dimension reduction (SDR) has progressed steadily. However, its ability to improve general function estimation or classification has not been well received, especially for high-dimensional data. In this article, we first devise a local linear smoother for high dimensional nonparametric regression and then utilise it in the outer-product-of-gradient (OPG) approach of SDR. We call the method high-dimensional OPG (HOPG). To apply SDR to classification in high-dimensional data, we propose an ensemble classifier by aggregating results of classifiers that are built on subspaces reduced by the random projection and HOPG consecutively from the data. Asymptotic results for both HOPG and the classifier are established. Superior performance over the existing methods is demonstrated in simulations and real data analyses. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1671-1681 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2003202 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2003202 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1671-1681 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2005609_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yichen Zhu Author-X-Name-First: Yichen Author-X-Name-Last: Zhu Author-Name: Cheng Li Author-X-Name-First: Cheng Author-X-Name-Last: Li Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Classification Trees for Imbalanced Data: Surface-to-Volume Regularization Abstract: Classification algorithms face difficulties when one or more classes have limited training data. We are particularly interested in classification trees, due to their interpretability and flexibility. When data are limited in one or more of the classes, the estimated decision boundaries are often irregularly shaped due to the limited sample size, leading to poor generalization error. We propose a novel approach that penalizes the Surface-to-Volume Ratio (SVR) of the decision set, obtaining a new class of SVR-Tree algorithms. We develop a simple and computationally efficient implementation while proving estimation consistency for SVR-Tree and rate of convergence for an idealized empirical risk minimizer of SVR-Tree. SVR-Tree is compared with multiple algorithms that are designed to deal with imbalance through real data applications. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1707-1717 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2005609 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2005609 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1707-1717 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2021919_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Anna E. Dudek Author-X-Name-First: Anna E. Author-X-Name-Last: Dudek Author-Name: Łukasz Lenart Author-X-Name-First: Łukasz Author-X-Name-Last: Lenart Title: Spectral Density Estimation for Nonstationary Data With Nonzero Mean Function Abstract: We introduce a new approach for nonparametric spectral density estimation based on the subsampling technique, which we apply to the important class of nonstationary time series. These are almost periodically correlated sequences. In contrary to existing methods, our technique does not require demeaning of the data. On the simulated data examples, we compare our estimator of spectral density function with the classical one. Additionally, we propose a modified estimator, which allows to reduce the leakage effect. Moreover, in the supplementary materials, we provide a simulation study and two real data economic applications. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1900-1910 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2021919 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2021919 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1900-1910 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2223578_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yin Xia Author-X-Name-First: Yin Author-X-Name-Last: Xia Author-Name: T. Tony Cai Author-X-Name-First: T. Tony Author-X-Name-Last: Cai Title: Discussion of “A Scale-Free Approach for False Discovery Rate Control in Generalized Linear Models” by Dai, Lin, Xing, and Liu Journal: Journal of the American Statistical Association Pages: 1569-1572 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2023.2223578 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2223578 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1569-1572 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2011735_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Xu Guo Author-X-Name-First: Xu Author-X-Name-Last: Guo Author-Name: Haojie Ren Author-X-Name-First: Haojie Author-X-Name-Last: Ren Author-Name: Changliang Zou Author-X-Name-First: Changliang Author-X-Name-Last: Zou Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Title: Threshold Selection in Feature Screening for Error Rate Control Abstract: Hard thresholding rule is commonly adopted in feature screening procedures to screen out unimportant predictors for ultrahigh-dimensional data. However, different thresholds are required to adapt to different contexts of screening problems and an appropriate thresholding magnitude usually varies from the model and error distribution. With an ad-hoc choice, it is unclear whether all of the important predictors are selected or not, and it is very likely that the procedures would include many unimportant features. We introduce a data-adaptive threshold selection procedure with error rate control, which is applicable to most kinds of popular screening methods. The key idea is to apply the sample-splitting strategy to construct a series of statistics with marginal symmetry property and then to utilize the symmetry for obtaining an approximation to the number of false discoveries. We show that the proposed method is able to asymptotically control the false discovery rate and per family error rate under certain conditions and still retains all of the important predictors. Three important examples are presented to illustrate the merits of the new proposed procedures. Numerical experiments indicate that the proposed methodology works well for many existing screening methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1773-1785 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2011735 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2011735 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1773-1785 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2016424_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Xingyu Zhou Author-X-Name-First: Xingyu Author-X-Name-Last: Zhou Author-Name: Yuling Jiao Author-X-Name-First: Yuling Author-X-Name-Last: Jiao Author-Name: Jin Liu Author-X-Name-First: Jin Author-X-Name-Last: Liu Author-Name: Jian Huang Author-X-Name-First: Jian Author-X-Name-Last: Huang Title: A Deep Generative Approach to Conditional Sampling Abstract: We propose a deep generative approach to sampling from a conditional distribution based on a unified formulation of conditional distribution and generalized nonparametric regression function using the noise-outsourcing lemma. The proposed approach aims at learning a conditional generator, so that a random sample from the target conditional distribution can be obtained by transforming a sample drawn from a reference distribution. The conditional generator is estimated nonparametrically with neural networks by matching appropriate joint distributions using the Kullback-Liebler divergence. An appealing aspect of our method is that it allows either of or both the predictor and the response to be high-dimensional and can handle both continuous and discrete type predictors and responses. We show that the proposed method is consistent in the sense that the conditional generator converges in distribution to the underlying conditional distribution under mild conditions. Our numerical experiments with simulated and benchmark image data validate the proposed method and demonstrate that it outperforms several existing conditional density estimation methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1837-1848 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2016424 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2016424 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1837-1848 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2004896_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Michael Law Author-X-Name-First: Michael Author-X-Name-Last: Law Author-Name: Ya’acov Ritov Author-X-Name-First: Ya’acov Author-X-Name-Last: Ritov Title: Inference and Estimation for Random Effects in High-Dimensional Linear Mixed Models Abstract: We consider three problems in high-dimensional linear mixed models. Without any assumptions on the design for the fixed effects, we construct asymptotic statistics for testing whether a collection of random effects is zero, derive an asymptotic confidence interval for a single random effect at the parametric rate n , and propose an empirical Bayes estimator for a part of the mean vector in ANOVA type models that performs asymptotically as well as the oracle Bayes estimator. We support our theoretical results with numerical simulations and provide comparisons with oracle estimators. The procedures developed are applied to the Trends in International Mathematics and Sciences Study (TIMSS) data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1682-1691 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2004896 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2004896 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1682-1691 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2044334_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Author-Name: Kai Xu Author-X-Name-First: Kai Author-X-Name-Last: Xu Author-Name: Yeqing Zhou Author-X-Name-First: Yeqing Author-X-Name-Last: Zhou Author-Name: Liping Zhu Author-X-Name-First: Liping Author-X-Name-Last: Zhu Title: Testing the Effects of High-Dimensional Covariates via Aggregating Cumulative Covariances Abstract: In this article, we test for the effects of high-dimensional covariates on the response. In many applications, different components of covariates usually exhibit various levels of variation, which is ubiquitous in high-dimensional data. To simultaneously accommodate such heteroscedasticity and high dimensionality, we propose a novel test based on an aggregation of the marginal cumulative covariances, requiring no prior information on the specific form of regression models. Our proposed test statistic is scale-invariance, tuning-free and convenient to implement. The asymptotic normality of the proposed statistic is established under the null hypothesis. We further study the asymptotic relative efficiency of our proposed test with respect to the state-of-art universal tests in two different settings: one is designed for high-dimensional linear model and the other is introduced in a completely model-free setting. A remarkable finding reveals that, thanks to the scale-invariance property, even under the high-dimensional linear models, our proposed test is asymptotically much more powerful than existing competitors for the covariates with heterogeneous variances while maintaining high efficiency for the homoscedastic ones. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2184-2194 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2022.2044334 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2044334 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2184-2194 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2195546_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yan Liu Author-X-Name-First: Yan Author-X-Name-Last: Liu Author-Name: Dewei Wang Author-X-Name-First: Dewei Author-X-Name-Last: Wang Author-Name: Li Li Author-X-Name-First: Li Author-X-Name-Last: Li Author-Name: Dingsheng Li Author-X-Name-First: Dingsheng Author-X-Name-Last: Li Title: Assessing Disparities in Americans’ Exposure to PCBs and PBDEs based on NHANES Pooled Biomonitoring Data Abstract: The National Health and Nutrition Examination Survey (NHANES) has been continuously biomonitoring Americans’ exposure to two families of harmful environmental chemicals: polychlorinated biphenyls (PCBs) and polybrominated diphenyl ethers (PBDEs). However, biomonitoring these chemicals is expensive. To save cost, in 2005, NHANES resorted to pooled biomonitoring; that is, amalgamating individual specimens to form a pool and measuring chemical levels from pools. Despite being publicly available, these pooled data gain limited applications in health studies. Among the few studies using these data, racial/age disparities were detected, but there is no control for confounding effects. These disadvantages are due to the complexity of pooled measurements and a dearth of statistical tools. Herein, we developed a regression-based method to unzip pooled measurements, which facilitated a comprehensive assessment of disparities in exposure to these chemicals. We found increasing dependence of PCBs on age and income, whereas PBDEs were the highest among adolescents and seniors and were elevated among the low-income population. In addition, Hispanics had the lowest PCBs and PBDEs among all demographic groups after controlling for potential confounders. These findings can guide the development of population-specific interventions to promote environmental justice. Moreover, both chemical levels declined throughout the period, indicating the effectiveness of existing regulatory policies. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1538-1550 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2023.2195546 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2195546 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1538-1550 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2003200_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Brian D. Williamson Author-X-Name-First: Brian D. Author-X-Name-Last: Williamson Author-Name: Peter B. Gilbert Author-X-Name-First: Peter B. Author-X-Name-Last: Gilbert Author-Name: Noah R. Simon Author-X-Name-First: Noah R. Author-X-Name-Last: Simon Author-Name: Marco Carone Author-X-Name-First: Marco Author-X-Name-Last: Carone Title: A General Framework for Inference on Algorithm-Agnostic Variable Importance Abstract: In many applications, it is of interest to assess the relative contribution of features (or subsets of features) toward the goal of predicting a response—in other words, to gauge the variable importance of features. Most recent work on variable importance assessment has focused on describing the importance of features within the confines of a given prediction algorithm. However, such assessment does not necessarily characterize the prediction potential of features, and may provide a misleading reflection of the intrinsic value of these features. To address this limitation, we propose a general framework for nonparametric inference on interpretable algorithm-agnostic variable importance. We define variable importance as a population-level contrast between the oracle predictiveness of all available features versus all features except those under consideration. We propose a nonparametric efficient estimation procedure that allows the construction of valid confidence intervals, even when machine learning techniques are used. We also outline a valid strategy for testing the null importance hypothesis. Through simulations, we show that our proposal has good operating characteristics, and we illustrate its use with data from a study of an antibody against HIV-1 infection. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1645-1658 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2003200 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2003200 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1645-1658 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2020658_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Amichai Painsky Author-X-Name-First: Amichai Author-X-Name-Last: Painsky Title: Generalized Good-Turing Improves Missing Mass Estimation Abstract: Consider a finite sample from an unknown distribution over a countable alphabet. The missing mass refers to the probability of symbols that do not appear in the sample. Estimating the missing mass is a basic problem in statistics and related fields, which dates back to the early work of Laplace, and the more recent seminal contribution of Good and Turing. In this article, we introduce a generalized Good-Turing (GT) framework for missing mass estimation. We derive an upper-bound for the risk (in terms of mean squared error) and minimize it over the parameters of our framework. Our analysis distinguishes between two setups, depending on the (unknown) alphabet size. When the alphabet size is bounded from above, our risk-bound demonstrates a significant improvement compared to currently known results (which are typically oblivious to the alphabet size). Based on this bound, we introduce a numerically obtained estimator that improves upon GT. When the alphabet size holds no restrictions, we apply our suggested risk-bound and introduce a closed-form estimator that again improves GT performance guarantees. Our suggested framework is easy to apply and does not require additional modeling assumptions. This makes it a favorable choice for practical applications. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1890-1899 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2020658 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2020658 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1890-1899 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2023551_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Wang Miao Author-X-Name-First: Wang Author-X-Name-Last: Miao Author-Name: Wenjie Hu Author-X-Name-First: Wenjie Author-X-Name-Last: Hu Author-Name: Elizabeth L. Ogburn Author-X-Name-First: Elizabeth L. Author-X-Name-Last: Ogburn Author-Name: Xiao-Hua Zhou Author-X-Name-First: Xiao-Hua Author-X-Name-Last: Zhou Title: Identifying Effects of Multiple Treatments in the Presence of Unmeasured Confounding Abstract: Identification of treatment effects in the presence of unmeasured confounding is a persistent problem in the social, biological, and medical sciences. The problem of unmeasured confounding in settings with multiple treatments is most common in statistical genetics and bioinformatics settings, where researchers have developed many successful statistical strategies without engaging deeply with the causal aspects of the problem. Recently there have been a number of attempts to bridge the gap between these statistical approaches and causal inference, but these attempts have either been shown to be flawed or have relied on fully parametric assumptions. In this article, we propose two strategies for identifying and estimating causal effects of multiple treatments in the presence of unmeasured confounding. The auxiliary variables approach leverages variables that are not causally associated with the outcome; in the case of a univariate confounder, our method only requires one auxiliary variable, unlike existing instrumental variable methods that would require as many instruments as there are treatments. An alternative null treatments approach relies on the assumption that at least half of the confounded treatments have no causal effect on the outcome, but does not require a priori knowledge of which treatments are null. Our identification strategies do not impose parametric assumptions on the outcome model and do not rest on estimation of the confounder. This article extends and generalizes existing work on unmeasured confounding with a single treatment and models commonly used in bioinformatics. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1953-1967 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2023551 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2023551 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1953-1967 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2018329_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Ruijia Wu Author-X-Name-First: Ruijia Author-X-Name-Last: Wu Author-Name: Linjun Zhang Author-X-Name-First: Linjun Author-X-Name-Last: Zhang Author-Name: T. Tony Cai Author-X-Name-First: T. Author-X-Name-Last: Tony Cai Title: Sparse Topic Modeling: Computational Efficiency, Near-Optimal Algorithms, and Statistical Inference Abstract: Sparse topic modeling under the probabilistic latent semantic indexing (pLSI) model is studied. Novel and computationally fast algorithms for estimation and inference of both the word-topic matrix and the topic-document matrix are proposed and their theoretical properties are investigated. Both minimax upper and lower bounds are established and the results show that the proposed algorithms are rate-optimal, up to a logarithmic factor. Moreover, a refitting algorithm is proposed to establish asymptotic normality and construct valid confidence intervals for the individual entries of the word-topic and topic-document matrices. Simulation studies are carried out to investigate the numerical performance of the proposed algorithms. The results show that the proposed algorithms perform well numerically and are more accurate in a range of simulation settings comparing to the existing literature. In addition, the methods are illustrated through an analysis of the COVID-19 Open Research Dataset (CORD-19). Journal: Journal of the American Statistical Association Pages: 1849-1861 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2018329 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2018329 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1849-1861 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2006667_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Kuang-Yao Lee Author-X-Name-First: Kuang-Yao Author-X-Name-Last: Lee Author-Name: Lexin Li Author-X-Name-First: Lexin Author-X-Name-Last: Li Author-Name: Bing Li Author-X-Name-First: Bing Author-X-Name-Last: Li Author-Name: Hongyu Zhao Author-X-Name-First: Hongyu Author-X-Name-Last: Zhao Title: Nonparametric Functional Graphical Modeling Through Functional Additive Regression Operator Abstract: In this article, we develop a nonparametric graphical model for multivariate random functions. Most existing graphical models are restricted by the assumptions of multivariate Gaussian or copula Gaussian distributions, which also imply linear relations among the random variables or functions on different nodes. We relax those assumptions by building our graphical model based on a new statistical object—the functional additive regression operator. By carrying out regression and neighborhood selection at the operator level, our method can capture nonlinear relations without requiring any distributional assumptions. Moreover, the method is built up using only one-dimensional kernel, thus, avoids the curse of dimensionality from which a fully nonparametric approach often suffers, and enables us to work with large-scale networks. We derive error bounds for the estimated regression operator and establish graph estimation consistency, while allowing the number of functions to diverge at the exponential rate of the sample size. We demonstrate the efficacy of our method by both simulations and analysis of an electroencephalography dataset. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1718-1732 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2006667 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2006667 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1718-1732 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2231063_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Michael Law Author-X-Name-First: Michael Author-X-Name-Last: Law Author-Name: Peter Bühlmann Author-X-Name-First: Peter Author-X-Name-Last: Bühlmann Title: Discussion of “A Scale-Free Approach for False Discovery Rate Control in Generalized Linear Models” Journal: Journal of the American Statistical Association Pages: 1578-1583 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2023.2231063 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2231063 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1578-1583 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2029456_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Ganggang Xu Author-X-Name-First: Ganggang Author-X-Name-Last: Xu Author-Name: Chen Liang Author-X-Name-First: Chen Author-X-Name-Last: Liang Author-Name: Rasmus Waagepetersen Author-X-Name-First: Rasmus Author-X-Name-Last: Waagepetersen Author-Name: Yongtao Guan Author-X-Name-First: Yongtao Author-X-Name-Last: Guan Title: Semiparametric Goodness-of-Fit Test for Clustered Point Processes with a Shape-Constrained Pair Correlation Function Abstract: Specification of a parametric model for the intensity function is a fundamental task in statistics for spatial point processes. It is, therefore, crucial to be able to assess the appropriateness of a suggested model for a given point pattern dataset. For this purpose, we develop a new class of semiparametric goodness-of-fit tests for the specified parametric first-order intensity, without assuming a full data generating mechanism that is needed for the existing popular Monte Carlo tests. The proposed tests crucially rely on accurate nonparametric estimation of the second-order properties of a point process. To address this we propose a new nonparametric pair correlation function (PCF) estimator for clustered spatial point processes under some mild shape constraints, which is shown to achieve uniform consistency. The proposed test statistics are computationally efficient owing to closed-form asymptotic distributions and achieve the nominal size even for testing composite hypotheses. In practice, the proposed estimation and testing procedures provide effective tools to improve parametric intensity function modeling, which is demonstrated through extensive simulation studies as well as a real data analysis of street crime activity in Washington DC. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2072-2087 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2022.2029456 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2029456 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2072-2087 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2035736_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Jiashun Jin Author-X-Name-First: Jiashun Author-X-Name-Last: Jin Author-Name: Zheng Tracy Ke Author-X-Name-First: Zheng Tracy Author-X-Name-Last: Ke Author-Name: Shengming Luo Author-X-Name-First: Shengming Author-X-Name-Last: Luo Author-Name: Minzhe Wang Author-X-Name-First: Minzhe Author-X-Name-Last: Wang Title: Optimal Estimation of the Number of Network Communities Abstract: In network analysis, how to estimate the number of communities K is a fundamental problem. We consider a broad setting where we allow severe degree heterogeneity and a wide range of sparsity levels, and propose Stepwise Goodness of Fit (StGoF) as a new approach. This is a stepwise algorithm, where for m=1,2,… , we alternately use a community detection step and a goodness of fit (GoF) step. We adapt SCORE Jin for community detection, and propose a new GoF metric. We show that at step m, the GoF metric diverges to ∞ in probability for all m < K and converges to N(0, 1) if m = K. This gives rise to a consistent estimate for K. Also, we discover the right way to define the signal-to-noise ratio (SNR) for our problem and show that consistent estimates for K do not exist if SNR→0 , and StGoF is uniformly consistent for K if SNR→∞ . Therefore, StGoF achieves the optimal phase transition.Similar stepwise methods are known to face analytical challenges. We overcome the challenges by using a different stepwise scheme in StGoF and by deriving sharp results that are not available before. The key to our analysis is to show that SCORE has the Nonsplitting Property (NSP). Primarily due to a nontractable rotation of eigenvectors dictated by the Davis–Kahan sin (θ) theorem, the NSP is nontrivial to prove and requires new techniques we develop. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2101-2116 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2022.2035736 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2035736 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2101-2116 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2224412_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Sai Li Author-X-Name-First: Sai Author-X-Name-Last: Li Author-Name: Yisha Yao Author-X-Name-First: Yisha Author-X-Name-Last: Yao Author-Name: Cun-Hui Zhang Author-X-Name-First: Cun-Hui Author-X-Name-Last: Zhang Title: Comments on “A Scale-Free Approach for False Discovery Rate Control in Generalized Linear Models” Journal: Journal of the American Statistical Association Pages: 1586-1589 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2023.2224412 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2224412 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1586-1589 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2024437_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Lexin Li Author-X-Name-First: Lexin Author-X-Name-Last: Li Author-Name: Jing Zeng Author-X-Name-First: Jing Author-X-Name-Last: Zeng Author-Name: Xin Zhang Author-X-Name-First: Xin Author-X-Name-Last: Zhang Title: Generalized Liquid Association Analysis for Multimodal Data Integration Abstract: Multimodal data are now prevailing in scientific research. One of the central questions in multimodal integrative analysis is to understand how two data modalities associate and interact with each other given another modality or demographic variables. The problem can be formulated as studying the associations among three sets of random variables, a question that has received relatively less attention in the literature. In this article, we propose a novel generalized liquid association analysis method, which offers a new and unique angle to this important class of problems of studying three-way associations. We extend the notion of liquid association from the univariate setting to the sparse, multivariate, and high-dimensional setting. We establish a population dimension reduction model, transform the problem to sparse Tucker decomposition of a three-way tensor, and develop a higher-order orthogonal iteration algorithm for parameter estimation. We derive the nonasymptotic error bound and asymptotic consistency of the proposed estimator, while allowing the variable dimensions to be larger than and diverge with the sample size. We demonstrate the efficacy of the method through both simulations and a multimodal neuroimaging application for Alzheimer’s disease research. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1984-1996 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2024437 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2024437 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1984-1996 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2183129_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: David Rios Insua Author-X-Name-First: David Author-X-Name-Last: Rios Insua Author-Name: Roi Naveiro Author-X-Name-First: Roi Author-X-Name-Last: Naveiro Author-Name: Víctor Gallego Author-X-Name-First: Víctor Author-X-Name-Last: Gallego Author-Name: Jason Poulos Author-X-Name-First: Jason Author-X-Name-Last: Poulos Title: Adversarial Machine Learning: Bayesian Perspectives Abstract: Adversarial Machine Learning (AML) is emerging as a major field aimed at protecting Machine Learning (ML) systems against security threats: in certain scenarios there may be adversaries that actively manipulate input data to fool learning systems. This creates a new class of security vulnerabilities that ML systems may face, and a new desirable property called adversarial robustness essential to trust operations based on ML outputs. Most work in AML is built upon a game-theoretic modeling of the conflict between a learning system and an adversary, ready to manipulate input data. This assumes that each agent knows their opponent’s interests and uncertainty judgments, facilitating inferences based on Nash equilibria. However, such common knowledge assumption is not realistic in the security scenarios typical of AML. After reviewing such game-theoretic approaches, we discuss the benefits that Bayesian perspectives provide when defending ML-based systems. We demonstrate how the Bayesian approach allows us to explicitly model our uncertainty about the opponent’s beliefs and interests, relaxing unrealistic assumptions, and providing more robust inferences. We illustrate this approach in supervised learning settings, and identify relevant future research problems. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2195-2206 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2023.2183129 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2183129 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2195-2206 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2002157_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yaoming Zhen Author-X-Name-First: Yaoming Author-X-Name-Last: Zhen Author-Name: Junhui Wang Author-X-Name-First: Junhui Author-X-Name-Last: Wang Title: Community Detection in General Hypergraph Via Graph Embedding Abstract: Conventional network data have largely focused on pairwise interactions between two entities, yet multi-way interactions among multiple entities have been frequently observed in real-life hypergraph networks. In this article, we propose a novel method for detecting community structure in general hypergraph networks, uniform or non-uniform. The proposed method introduces a null vertex to augment a nonuniform hypergraph into a uniform multi-hypergraph, and then embeds the multi-hypergraph in a low-dimensional vector space such that vertices within the same community are close to each other. The resultant optimization task can be efficiently tackled by an alternative updating scheme. The asymptotic consistencies of the proposed method are established in terms of both community detection and hypergraph estimation, which are also supported by numerical experiments on some synthetic and real-life hypergraph networks. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1620-1629 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2002157 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2002157 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1620-1629 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2008402_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Ethan X. Fang Author-X-Name-First: Ethan X. Author-X-Name-Last: Fang Author-Name: Zhaoran Wang Author-X-Name-First: Zhaoran Author-X-Name-Last: Wang Author-Name: Lan Wang Author-X-Name-First: Lan Author-X-Name-Last: Wang Title: Fairness-Oriented Learning for Optimal Individualized Treatment Rules Abstract: There has recently been a surge on the methodological development for optimal individualized treatment rule (ITR) estimation. The standard methods in the literature are designed to maximize the potential average performance (assuming larger outcomes are desirable). A notable drawback of the standard approach, due to heterogeneity in treatment response, is that the estimated optimal ITR may be suboptimal or even detrimental to certain disadvantaged subpopulations. Motivated by the importance of incorporating an appropriate fairness constraint in optimal decision making (e.g., assign treatment with protection to those with shorter survival time, or assign a job training program with protection to those with lower wages), we propose a new framework that aims to estimate an optimal ITR to maximize the average value with the guarantee that its tail performance exceeds a prespecified threshold. The optimal fairness-oriented ITR corresponds to a solution of a nonconvex optimization problem. To handle the computational challenge, we develop a new efficient first-order algorithm. We establish theoretical guarantees for the proposed estimator. Furthermore, we extend the proposed method to dynamic optimal ITRs. The advantages of the proposed approach over existing methods are demonstrated via extensive numerical studies and real data analysis. Journal: Journal of the American Statistical Association Pages: 1733-1746 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2008402 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2008402 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1733-1746 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2231224_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Martin Holub Author-X-Name-First: Martin Author-X-Name-Last: Holub Author-Name: Patrícia Martinková Author-X-Name-First: Patrícia Author-X-Name-Last: Martinková Title: Supervised Machine Learning for Text Analysis in R Journal: Journal of the American Statistical Association Pages: 2207-2209 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2023.2231224 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2231224 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2207-2209 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2016423_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Shunan Yao Author-X-Name-First: Shunan Author-X-Name-Last: Yao Author-Name: Bradley Rava Author-X-Name-First: Bradley Author-X-Name-Last: Rava Author-Name: Xin Tong Author-X-Name-First: Xin Author-X-Name-Last: Tong Author-Name: Gareth James Author-X-Name-First: Gareth Author-X-Name-Last: James Title: Asymmetric Error Control Under Imperfect Supervision: A Label-Noise-Adjusted Neyman–Pearson Umbrella Algorithm Abstract: Label noise in data has long been an important problem in supervised learning applications as it affects the effectiveness of many widely used classification methods. Recently, important real-world applications, such as medical diagnosis and cybersecurity, have generated renewed interest in the Neyman–Pearson (NP) classification paradigm, which constrains the more severe type of error (e.g., the Type I error) under a preferred level while minimizing the other (e.g., the Type II error). However, there has been little research on the NP paradigm under label noise. It is somewhat surprising that even when common NP classifiers ignore the label noise in the training stage, they are still able to control the Type I error with high probability. However, the price they pay is excessive conservativeness of the Type I error and hence a significant drop in power (i.e., 1 - Type II error). Assuming that domain experts provide lower bounds on the corruption severity, we propose the first theory-backed algorithm that adapts most state-of-the-art classification methods to the training label noise under the NP paradigm. The resulting classifiers not only control the Type I error with high probability under the desired level but also improve power. Journal: Journal of the American Statistical Association Pages: 1824-1836 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2016423 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2016423 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1824-1836 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2034632_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Jingfei Zhang Author-X-Name-First: Jingfei Author-X-Name-Last: Zhang Author-Name: Yi Li Author-X-Name-First: Yi Author-X-Name-Last: Li Title: High-Dimensional Gaussian Graphical Regression Models with Covariates Abstract: Though Gaussian graphical models have been widely used in many scientific fields, relatively limited progress has been made to link graph structures to external covariates. We propose a Gaussian graphical regression model, which regresses both the mean and the precision matrix of a Gaussian graphical model on covariates. In the context of co-expression quantitative trait locus (QTL) studies, our method can determine how genetic variants and clinical conditions modulate the subject-level network structures, and recover both the population-level and subject-level gene networks. Our framework encourages sparsity of covariate effects on both the mean and the precision matrix. In particular for the precision matrix, we stipulate simultaneous sparsity, that is, group sparsity and element-wise sparsity, on effective covariates and their effects on network edges, respectively. We establish variable selection consistency first under the case with known mean parameters and then a more challenging case with unknown means depending on external covariates, and establish in both cases the l2 convergence rates and the selection consistency of the estimated precision parameters. The utility and efficacy of our proposed method is demonstrated through simulation studies and an application to a co-expression QTL study with brain cancer patients. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2088-2100 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2022.2034632 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2034632 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2088-2100 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2016422_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Janice L. Scealy Author-X-Name-First: Janice L. Author-X-Name-Last: Scealy Author-Name: Andrew T. A. Wood Author-X-Name-First: Andrew T. A. Author-X-Name-Last: Wood Title: Score Matching for Compositional Distributions Abstract: Compositional data are challenging to analyse due to the non-negativity and sum-to-one constraints on the sample space. With real data, it is often the case that many of the compositional components are highly right-skewed, with large numbers of zeros. Major limitations of currently available models for compositional data include one or more of the following: insufficient flexibility in terms of distributional shape; difficulty in accommodating zeros in the data in estimation; and lack of computational viability in moderate to high dimensions. In this article, we propose a new model, the polynomially tilted pairwise interaction (PPI) model, for analysing compositional data. Maximum likelihood estimation is difficult for the PPI model. Instead, we propose novel score matching estimators, which entails extending the score matching approach to Riemannian manifolds with boundary. These new estimators are available in closed form and simulation studies show that they perform well in practice. As our main application, we analyse real microbiome count data with fixed totals using a multinomial latent variable model with a PPI model for the latent variable distribution. We prove that, under certain conditions, the new score matching estimators are consistent for the parameters in the new multinomial latent variable model. Journal: Journal of the American Statistical Association Pages: 1811-1823 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2016422 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2016422 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1811-1823 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2002156_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Decai Liang Author-X-Name-First: Decai Author-X-Name-Last: Liang Author-Name: Hui Huang Author-X-Name-First: Hui Author-X-Name-Last: Huang Author-Name: Yongtao Guan Author-X-Name-First: Yongtao Author-X-Name-Last: Guan Author-Name: Fang Yao Author-X-Name-First: Fang Author-X-Name-Last: Yao Title: Test of Weak Separability for Spatially Stationary Functional Field Abstract: For spatially dependent functional data, a generalized Karhunen-Loève expansion is commonly used to decompose data into an additive form of temporal components and spatially correlated coefficients. This structure provides a convenient model to investigate the space-time interactions, but may not hold for complex spatio-temporal processes. In this work, we introduce the concept of weak separability, and propose a formal test to examine its validity for non-replicated spatially stationary functional field. The asymptotic distribution of the test statistic that adapts to potentially diverging ranks is derived by constructing lag covariance estimation, which is easy to compute for practical implementation. We demonstrate the efficacy of the proposed test via simulations and illustrate its usefulness in two real examples: China PM 2.5 data and Harvard Forest data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1606-1619 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2002156 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2002156 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1606-1619 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2038180_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Zhengling Qi Author-X-Name-First: Zhengling Author-X-Name-Last: Qi Author-Name: Jong-Shi Pang Author-X-Name-First: Jong-Shi Author-X-Name-Last: Pang Author-Name: Yufeng Liu Author-X-Name-First: Yufeng Author-X-Name-Last: Liu Title: On Robustness of Individualized Decision Rules Abstract: With the emergence of precision medicine, estimating optimal individualized decision rules (IDRs) has attracted tremendous attention in many scientific areas. Most existing literature has focused on finding optimal IDRs that can maximize the expected outcome for each individual. Motivated by complex individualized decision making procedures and the popular conditional value at risk (CVaR) measure, we propose a new robust criterion to estimate optimal IDRs in order to control the average lower tail of the individuals’ outcomes. In addition to improving the individualized expected outcome, our proposed criterion takes risks into consideration, and thus the resulting IDRs can prevent adverse events. The optimal IDR under our criterion can be interpreted as the decision rule that maximizes the “worst-case” scenario of the individualized outcome when the underlying distribution is perturbed within a constrained set. An efficient non-convex optimization algorithm is proposed with convergence guarantees. We investigate theoretical properties for our estimated optimal IDRs under the proposed criterion such as consistency and finite sample error bounds. Simulation studies and a real data application are used to further demonstrate the robust performance of our methods. Several extensions of the proposed method are also discussed. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2143-2157 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2022.2038180 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2038180 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2143-2157 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2037431_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yinpu Li Author-X-Name-First: Yinpu Author-X-Name-Last: Li Author-Name: Antonio R. Linero Author-X-Name-First: Antonio R. Author-X-Name-Last: Linero Author-Name: Jared Murray Author-X-Name-First: Jared Author-X-Name-Last: Murray Title: Adaptive Conditional Distribution Estimation with Bayesian Decision Tree Ensembles Abstract: We present a Bayesian nonparametric model for conditional distribution estimation using Bayesian additive regression trees (BART). The generative model we use is based on rejection sampling from a base model. Like other BART models, our model is flexible, has a default prior specification, and is computationally convenient. To address the distinguished role of the response in our BART model, we introduce an approach to targeted smoothing of BART models which is of independent interest. We study the proposed model theoretically and provide sufficient conditions for the posterior distribution to concentrate at close to the minimax optimal rate adaptively over smoothness classes in the high-dimensional regime in which many predictors are irrelevant. To fit our model, we propose a data augmentation algorithm which allows for existing BART samplers to be extended with minimal effort. We illustrate the performance of our methodology on simulated data and use it to study the relationship between education and body mass index using data from the medical expenditure panel survey (MEPS). Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2129-2142 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2022.2037431 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2037431 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2129-2142 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2037430_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Federico Castelletti Author-X-Name-First: Federico Author-X-Name-Last: Castelletti Author-Name: Stefano Peluso Author-X-Name-First: Stefano Author-X-Name-Last: Peluso Title: Network Structure Learning Under Uncertain Interventions Abstract: Gaussian Directed Acyclic Graphs (DAGs) represent a powerful tool for learning the network of dependencies among variables, a task which is of primary interest in many fields and specifically in biology. Different DAGs may encode equivalent conditional independence structures, implying limited ability, with observational data, to identify causal relations. In many contexts however, measurements are collected under heterogeneous settings where variables are subject to exogenous interventions. Interventional data can improve the structure learning process whenever the targets of an intervention are known. However, these are often uncertain or completely unknown, as in the context of drug target discovery. We propose a Bayesian method for learning dependence structures and intervention targets from data subject to interventions on unknown variables of the system. Selected features of our approach include a DAG-Wishart prior on the DAG parameters, and the use of variable selection priors to express uncertainty on the targets. We provide theoretical results on the correct asymptotic identification of intervention targets and derive sufficient conditions for Bayes factor and posterior ratio consistency of the graph structure. Our method is applied in simulations and real-data world settings, to analyze perturbed protein data and assess antiepileptic drug therapies. Details of the MCMC algorithm and proofs of propositions are provided in the supplementary materials, together with more extensive results on simulations and applied studies. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2117-2128 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2022.2037430 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2037430 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2117-2128 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2023552_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Tianhao Wang Author-X-Name-First: Tianhao Author-X-Name-Last: Wang Author-Name: Sarah J. Ratcliffe Author-X-Name-First: Sarah J. Author-X-Name-Last: Ratcliffe Author-Name: Wensheng Guo Author-X-Name-First: Wensheng Author-X-Name-Last: Guo Title: Time-to-Event Analysis with Unknown Time Origins via Longitudinal Biomarker Registration Abstract: In observational studies, the time origin of interest for time-to-event analysis is often unknown, such as the time of disease onset. Existing approaches to estimating the time origins are commonly built on extrapolating a parametric longitudinal model, which rely on rigid assumptions that can lead to biased inferences. In this paper, we introduce a flexible semiparametric curve registration model. It assumes the longitudinal trajectories follow a flexible common shape function with person-specific disease progression pattern characterized by a random curve registration function, which is further used to model the unknown time origin as a random start time. This random time is used as a link to jointly model the longitudinal and survival data where the unknown time origins are integrated out in the joint likelihood function, which facilitates unbiased and consistent estimation. Since the disease progression pattern naturally predicts time-to-event, we further propose a new functional survival model using the registration function as a predictor of the time-to-event. The asymptotic consistency and semiparametric efficiency of the proposed models are proved. Simulation studies and two real data applications demonstrate the effectiveness of this new approach. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1968-1983 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2023552 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2023552 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1968-1983 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2002158_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Ying Yang Fang Yao Author-X-Name-First: Ying Yang Author-X-Name-Last: Fang Yao Title: Online Estimation for Functional Data Abstract: Functional data analysis has attracted considerable interest and is facing new challenges, one of which is the increasingly available data in a streaming manner. In this article we develop an online nonparametric method to dynamically update the estimates of mean and covariance functions for functional data. The kernel-type estimates can be decomposed into two sufficient statistics depending on the data-driven bandwidths. We propose to approximate the future optimal bandwidths by a sequence of dynamically changing candidates and combine the corresponding statistics across blocks to form the updated estimation. The proposed online method is easy to compute based on the stored sufficient statistics and the current data block. We derive the asymptotic normality and, more importantly, the relative efficiency lower bounds of the online estimates of mean and covariance functions. This provides insight into the relationship between estimation accuracy and computational cost driven by the length of candidate bandwidth sequence. Simulations and real data examples are provided to support such findings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1630-1644 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2002158 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2002158 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1630-1644 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2027775_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Pulong Ma Author-X-Name-First: Pulong Author-X-Name-Last: Ma Author-Name: Anindya Bhadra Author-X-Name-First: Anindya Author-X-Name-Last: Bhadra Title: Beyond Matérn: On A Class of Interpretable Confluent Hypergeometric Covariance Functions Abstract: The Matérn covariance function is a popular choice for prediction in spatial statistics and uncertainty quantification literature. A key benefit of the Matérn class is that it is possible to get precise control over the degree of mean-square differentiability of the random process. However, the Matérn class possesses exponentially decaying tails, and thus, may not be suitable for modeling polynomially decaying dependence. This problem can be remedied using polynomial covariances; however, one loses control over the degree of mean-square differentiability of corresponding processes, in that random processes with existing polynomial covariances are either infinitely mean-square differentiable or nowhere mean-square differentiable at all. We construct a new family of covariance functions called the Confluent Hypergeometric (CH) class using a scale mixture representation of the Matérn class where one obtains the benefits of both Matérn and polynomial covariances. The resultant covariance contains two parameters: one controls the degree of mean-square differentiability near the origin and the other controls the tail heaviness, independently of each other. Using a spectral representation, we derive theoretical properties of this new covariance including equivalent measures and asymptotic behavior of the maximum likelihood estimators under infill asymptotics. The improved theoretical properties of the CH class are verified via extensive simulations. Application using NASA’s Orbiting Carbon Observatory-2 satellite data confirms the advantage of the CH class over the Matérn class, especially in extrapolative settings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2045-2058 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2022.2027775 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2027775 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2045-2058 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2021920_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Gonzalo Vazquez-Bare Author-X-Name-First: Gonzalo Author-X-Name-Last: Vazquez-Bare Title: Causal Spillover Effects Using Instrumental Variables Abstract: I set up a potential outcomes framework to analyze spillover effects using instrumental variables. I characterize the population compliance types in a setting in which spillovers can occur on both treatment take-up and outcomes, and provide conditions for identification of the marginal distribution of compliance types. I show that intention-to-treat (ITT) parameters aggregate multiple direct and spillover effects for different compliance types, and hence do not have a clear link to causally interpretable parameters. Moreover, rescaling ITT parameters by first-stage estimands generally recovers a weighted combination of average effects where the sum of weights is larger than one. I then analyze identification of causal direct and spillover effects under one-sided noncompliance, and show that causal effects can be estimated by 2SLS in this case. I illustrate the proposed methods using data from an experiment on social interactions and voting behavior. I also introduce an alternative assumption, independence of the peers’ types, that identifies parameters of interest under two-sided noncompliance by restricting the amount of heterogeneity in average potential outcomes. Supplementary material of this article will be available in online. Journal: Journal of the American Statistical Association Pages: 1911-1922 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2021920 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2021920 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1911-1922 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2245686_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Chenguang Dai Author-X-Name-First: Chenguang Author-X-Name-Last: Dai Author-Name: Buyu Lin Author-X-Name-First: Buyu Author-X-Name-Last: Lin Author-Name: Xin Xing Author-X-Name-First: Xin Author-X-Name-Last: Xing Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Title: Rejoinder: A Scale-free Approach for False Discovery Rate Control in Generalized Linear Models Journal: Journal of the American Statistical Association Pages: 1590-1594 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2023.2245686 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2245686 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1590-1594 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2157727_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Xinzhou Guo Author-X-Name-First: Xinzhou Author-X-Name-Last: Guo Author-Name: Waverly Wei Author-X-Name-First: Waverly Author-X-Name-Last: Wei Author-Name: Molei Liu Author-X-Name-First: Molei Author-X-Name-Last: Liu Author-Name: Tianxi Cai Author-X-Name-First: Tianxi Author-X-Name-Last: Cai Author-Name: Chong Wu Author-X-Name-First: Chong Author-X-Name-Last: Wu Author-Name: Jingshen Wang Author-X-Name-First: Jingshen Author-X-Name-Last: Wang Title: Assessing the Most Vulnerable Subgroup to Type II Diabetes Associated with Statin Usage: Evidence from Electronic Health Record Data Abstract: There have been increased concerns that the use of statins, one of the most commonly prescribed drugs for treating coronary artery disease, is potentially associated with the increased risk of new-onset Type II diabetes (T2D). Nevertheless, to date, there is no robust evidence supporting as to whether and what kind of populations are indeed vulnerable for developing T2D after taking statins. In this case study, leveraging the biobank and electronic health record data in the Partner Health System, we introduce a new data analysis pipeline and a novel statistical methodology that address existing limitations by (i) designing a rigorous causal framework that systematically examines the causal effects of statin usage on T2D risk in observational data, (ii) uncovering which patient subgroup is most vulnerable for developing T2D after taking statins, and (iii) assessing the replicability and statistical significance of the most vulnerable subgroup via a bootstrap calibration procedure. Our proposed approach delivers asymptotically sharp confidence intervals and debiased estimate for the treatment effect of the most vulnerable subgroup in the presence of high-dimensional covariates. With our proposed approach, we find that females with high T2D genetic risk are at the highest risk of developing T2D due to statin usage. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1488-1499 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2022.2157727 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2157727 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1488-1499 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2223656_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Gerda Claeskens Author-X-Name-First: Gerda Author-X-Name-Last: Claeskens Author-Name: Maarten Jansen Author-X-Name-First: Maarten Author-X-Name-Last: Jansen Author-Name: Jing Zhou Author-X-Name-First: Jing Author-X-Name-Last: Zhou Title: Discussion on: “A Scale-Free Approach for False Discovery Rate Control in Generalized Linear Models” by Dai, Lin, Zing, Liu Journal: Journal of the American Statistical Association Pages: 1573-1577 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2023.2223656 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2223656 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1573-1577 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2000868_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Harold D. Chiang Author-X-Name-First: Harold D. Author-X-Name-Last: Chiang Author-Name: Kengo Kato Author-X-Name-First: Kengo Author-X-Name-Last: Kato Author-Name: Yuya Sasaki Author-X-Name-First: Yuya Author-X-Name-Last: Sasaki Title: Inference for High-Dimensional Exchangeable Arrays Abstract: We consider inference for high-dimensional separately and jointly exchangeable arrays where the dimensions may be much larger than the sample sizes. For both exchangeable arrays, we first derive high-dimensional central limit theorems over the rectangles and subsequently develop novel multiplier bootstraps with theoretical guarantees. These theoretical results rely on new technical tools such as Hoeffding-type decomposition and maximal inequalities for the degenerate components in the Hoeffiding-type decomposition for the exchangeable arrays. We exhibit applications of our methods to uniform confidence bands for density estimation under joint exchangeability and penalty choice for l1-penalized regression under separate exchangeability. Extensive simulations demonstrate precise uniform coverage rates. We illustrate by constructing uniform confidence bands for international trade network densities. Journal: Journal of the American Statistical Association Pages: 1595-1605 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2000868 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2000868 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1595-1605 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2232834_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Lucas Janson Author-X-Name-First: Lucas Author-X-Name-Last: Janson Title: Discussion of “A Scale-Free Approach for False Discovery Rate Control in Generalized Linear Models” by Chenguang Dai, Buyu Lin, Xin Xing, and Jun S. Liu Journal: Journal of the American Statistical Association Pages: 1584-1585 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2023.2232834 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2232834 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1584-1585 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2156349_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Yize Zhao Author-X-Name-First: Yize Author-X-Name-Last: Zhao Author-Name: Changgee Chang Author-X-Name-First: Changgee Author-X-Name-Last: Chang Author-Name: Jingwen Zhang Author-X-Name-First: Jingwen Author-X-Name-Last: Zhang Author-Name: Zhengwu Zhang Author-X-Name-First: Zhengwu Author-X-Name-Last: Zhang Title: Genetic Underpinnings of Brain Structural Connectome for Young Adults Abstract: With distinct advantages in power over behavioral phenotypes, brain imaging traits have become emerging endophenotypes to dissect molecular contributions to behaviors and neuropsychiatric illnesses. Among different imaging features, brain structural connectivity (i.e., structural connectome) which summarizes the anatomical connections between different brain regions is one of the most cutting edge while under-investigated traits; and the genetic influence on the structural connectome variation remains highly elusive. Relying on a landmark imaging genetics study for young adults, we develop a biologically plausible brain network response shrinkage model to comprehensively characterize the relationship between high dimensional genetic variants and the structural connectome phenotype. Under a unified Bayesian framework, we accommodate the topology of brain network and biological architecture within the genome; and eventually establish a mechanistic mapping between genetic biomarkers and the associated brain sub-network units. An efficient expectation-maximization algorithm is developed to estimate the model and ensure computing feasibility. In the application to the Human Connectome Project Young Adult (HCP-YA) data, we establish the genetic underpinnings which are highly interpretable under functional annotation and brain tissue eQTL analysis, for the brain white matter tracts connecting the hippocampus and two cerebral hemispheres. We also show the superiority of our method in extensive simulations. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1473-1487 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2022.2156349 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2156349 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1473-1487 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2021921_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Marc Hallin Author-X-Name-First: Marc Author-X-Name-Last: Hallin Author-Name: Daniel Hlubinka Author-X-Name-First: Daniel Author-X-Name-Last: Hlubinka Author-Name: Šárka Hudecová Author-X-Name-First: Šárka Author-X-Name-Last: Hudecová Title: Efficient Fully Distribution-Free Center-Outward Rank Tests for Multiple-Output Regression and MANOVA Abstract: Extending rank-based inference to a multivariate setting such as multiple-output regression or MANOVA with unspecified d-dimensional error density has remained an open problem for more than half a century. None of the many solutions proposed so far is enjoying the combination of distribution-freeness and efficiency that makes rank-based inference a successful tool in the univariate setting. A concept of center-outward multivariate ranks and signs based on measure transportation ideas has been introduced recently. Center-outward ranks and signs are not only distribution-free but achieve in dimension d > 1 the (essential) maximal ancillarity property of traditional univariate ranks. In the present case, we show that fully distribution-free testing procedures based on center-outward ranks can achieve parametric efficiency. We establish the Hájek representation and asymptotic normality results required in the construction of such tests in multiple-output regression and MANOVA models. Simulations and an empirical study demonstrate the excellent performance of the proposed procedures. Journal: Journal of the American Statistical Association Pages: 1923-1939 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2021921 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2021921 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1923-1939 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2027776_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Chengchun Shi Author-X-Name-First: Chengchun Author-X-Name-Last: Shi Author-Name: Xiaoyu Wang Author-X-Name-First: Xiaoyu Author-X-Name-Last: Wang Author-Name: Shikai Luo Author-X-Name-First: Shikai Author-X-Name-Last: Luo Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Author-Name: Jieping Ye Author-X-Name-First: Jieping Author-X-Name-Last: Ye Author-Name: Rui Song Author-X-Name-First: Rui Author-X-Name-Last: Song Title: Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework Abstract: A/B testing, or online experiment is a standard business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries. Major challenges arise in online experiments of two-sided marketplace platforms (e.g., Uber) where there is only one unit that receives a sequence of treatments over time. In those experiments, the treatment at a given time impacts current outcome as well as future outcomes. The aim of this article is to introduce a reinforcement learning framework for carrying A/B testing in these experiments, while characterizing the long-term treatment effects. Our proposed testing procedure allows for sequential monitoring and online updating. It is generally applicable to a variety of treatment designs in different industries. In addition, we systematically investigate the theoretical properties (e.g., size and power) of our testing procedure. Finally, we apply our framework to both simulated data and a real-world data example obtained from a technological company to illustrate its advantage over the current practice. A Python implementation of our test is available at https://github.com/callmespring/CausalRL. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2059-2071 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2022.2027776 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2027776 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2059-2071 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2024836_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Pierre Alquier Author-X-Name-First: Pierre Author-X-Name-Last: Alquier Author-Name: Badr-Eddine Chérief-Abdellatif Author-X-Name-First: Badr-Eddine Author-X-Name-Last: Chérief-Abdellatif Author-Name: Alexis Derumigny Author-X-Name-First: Alexis Author-X-Name-Last: Derumigny Author-Name: Jean-David Fermanian Author-X-Name-First: Jean-David Author-X-Name-Last: Fermanian Title: Estimation of Copulas via Maximum Mean Discrepancy Abstract: This article deals with robust inference for parametric copula models. Estimation using canonical maximum likelihood might be unstable, especially in the presence of outliers. We propose to use a procedure based on the maximum mean discrepancy (MMD) principle. We derive nonasymptotic oracle inequalities, consistency and asymptotic normality of this new estimator. In particular, the oracle inequality holds without any assumption on the copula family, and can be applied in the presence of outliers or under misspecification. Moreover, in our MMD framework, the statistical inference of copula models for which there exists no density with respect to the Lebesgue measure on [0,1]d, as the Marshall-Olkin copula, becomes feasible. A simulation study shows the robustness of our new procedures, especially compared to pseudo-maximum likelihood estimation. An R package implementing the MMD estimator for copula models is available. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1997-2012 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2024836 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2024836 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1997-2012 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2008944_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20 Author-Name: Haiqiang Ma Author-X-Name-First: Haiqiang Author-X-Name-Last: Ma Author-Name: Jiming Jiang Author-X-Name-First: Jiming Author-X-Name-Last: Jiang Title: Pseudo-Bayesian Classified Mixed Model Prediction Abstract: We propose a new classified mixed model prediction (CMMP) procedure, called pseudo-Bayesian CMMP, that uses network information in matching the group index between the training data and new data, whose characteristics of interest one wishes to predict. The current CMMP procedures do not incorporate such information; as a result, the methods are not consistent in terms of matching the group index. Although, as the number of training data groups increases, the current CMMP method can predict the mixed effects of interest consistently, its accuracy is not guaranteed when the number of groups is moderate, as is the case in many potential applications. The proposed pseudo-Bayesian CMMP procedure assumes a flexible working probability model for the group index of the new observation to match the index of a training data group, which may be viewed as a pseudo prior. We show that, given any working model satisfying mild conditions, the pseudo-Bayesian CMMP procedure is consistent and asymptotically optimal both in terms of matching the group index and in terms of predicting the mixed effect of interest associated with the new observations. The theoretical results are fully supported by results of empirical studies, including Monte-Carlo simulations and real-data validation. Journal: Journal of the American Statistical Association Pages: 1747-1759 Issue: 543 Volume: 118 Year: 2023 Month: 7 X-DOI: 10.1080/01621459.2021.2008944 File-URL: http://hdl.handle.net/10.1080/01621459.2021.2008944 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1747-1759 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2044824_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Zhuoran Yang Author-X-Name-First: Zhuoran Author-X-Name-Last: Yang Author-Name: Mengxin Yu Author-X-Name-First: Mengxin Author-X-Name-Last: Yu Title: Understanding Implicit Regularization in Over-Parameterized Single Index Model Abstract: In this article, we leverage over-parameterization to design regularization-free algorithms for the high-dimensional single index model and provide theoretical guarantees for the induced implicit regularization phenomenon. Specifically, we study both vector and matrix single index models where the link function is nonlinear and unknown, the signal parameter is either a sparse vector or a low-rank symmetric matrix, and the response variable can be heavy-tailed. To gain a better understanding of the role played by implicit regularization without excess technicality, we assume that the distribution of the covariates is known a priori. For both the vector and matrix settings, we construct an over-parameterized least-squares loss function by employing the score function transform and a robust truncation step designed specifically for heavy-tailed data. We propose to estimate the true parameter by applying regularization-free gradient descent to the loss function. When the initialization is close to the origin and the stepsize is sufficiently small, we prove that the obtained solution achieves minimax optimal statistical rates of convergence in both the vector and matrix cases. In addition, our experimental results support our theoretical findings and also demonstrate that our methods empirically outperform classical methods with explicit regularization in terms of both l2-statistical rate and variable selection consistency. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2315-2328 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2044824 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2044824 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2315-2328 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2061354_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Xiufan Yu Author-X-Name-First: Xiufan Author-X-Name-Last: Yu Author-Name: Danning Li Author-X-Name-First: Danning Author-X-Name-Last: Li Author-Name: Lingzhou Xue Author-X-Name-First: Lingzhou Author-X-Name-Last: Xue Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Title: Power-Enhanced Simultaneous Test of High-Dimensional Mean Vectors and Covariance Matrices with Application to Gene-Set Testing Abstract: Power-enhanced tests with high-dimensional data have received growing attention in theoretical and applied statistics in recent years. Existing tests possess their respective high-power regions, and we may lack prior knowledge about the alternatives when testing for a problem of interest in practice. There is a critical need of developing powerful testing procedures against more general alternatives. This article studies the joint test of two-sample mean vectors and covariance matrices for high-dimensional data. We first expand the high-power regions of high-dimensional mean tests or covariance tests to a wider alternative space and then combine their strengths together in the simultaneous test. We develop a new power-enhanced simultaneous test that is powerful to detect differences in either mean vectors or covariance matrices under either sparse or dense alternatives. We prove that the proposed testing procedures align with the power enhancement principles introduced by Fan, Liao, and Yao and achieve the accurate asymptotic size and consistent asymptotic power. We demonstrate the finite-sample performance using simulation studies and a real application to find differentially expressed gene-sets in cancer studies. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2548-2561 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2061354 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2061354 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2548-2561 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2070070_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Baihua He Author-X-Name-First: Baihua Author-X-Name-Last: He Author-Name: Shuangge Ma Author-X-Name-First: Shuangge Author-X-Name-Last: Ma Author-Name: Xinyu Zhang Author-X-Name-First: Xinyu Author-X-Name-Last: Zhang Author-Name: Li-Xing Zhu Author-X-Name-First: Li-Xing Author-X-Name-Last: Zhu Title: Rank-Based Greedy Model Averaging for High-Dimensional Survival Data Abstract: Model averaging is an effective way to enhance prediction accuracy. However, most previous works focus on low-dimensional settings with completely observed responses. To attain an accurate prediction for the risk effect of survival data with high-dimensional predictors, we propose a novel method: rank-based greedy (RG) model averaging. Specifically, adopting the transformation model with splitting predictors as working models, we doubly use the smooth concordance index function to derive the candidate predictions and optimal model weights. The final prediction is achieved by weighted averaging all the candidates. Our approach is flexible, computationally efficient, and robust against model misspecification, as it neither requires the correctness of a joint model nor involves the estimation of the transformation function. We further adopt the greedy algorithm for high dimensions. Theoretically, we derive an asymptotic error bound for the optimal weights under some mild conditions. In addition, the summation of weights assigned to the correct candidate submodels is proven to approach one in probability when there are correct models included among the candidate submodels. Extensive numerical studies are carried out using both simulated and real datasets to show the proposed approach’s robust performance compared to the existing regularization approaches. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2658-2670 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2070070 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2070070 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2658-2670 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2044827_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Yanyan Zeng Author-X-Name-First: Yanyan Author-X-Name-Last: Zeng Author-Name: Daolin Pang Author-X-Name-First: Daolin Author-X-Name-Last: Pang Author-Name: Hongyu Zhao Author-X-Name-First: Hongyu Author-X-Name-Last: Zhao Author-Name: Tao Wang Author-X-Name-First: Tao Author-X-Name-Last: Wang Title: A Zero-Inflated Logistic Normal Multinomial Model for Extracting Microbial Compositions Abstract: High throughput sequencing data collected to study the microbiome provide information in the form of relative abundances and should be treated as compositions. Although many approaches including scaling and rarefaction have been proposed for converting raw count data into microbial compositions, most of these methods simply return zero values for zero counts. However, zeros can distort downstream analyses, and they can also pose problems for composition-aware methods. This problem is exacerbated with microbiome abundance data because they are sparse with excessive zeros. In addition to data sparsity, microbial composition estimation depends on other data characteristics such as high dimensionality, over-dispersion, and complex co-occurrence relationships. To address these challenges, we introduce a zero-inflated probabilistic PCA (ZIPPCA) model that accounts for the compositional nature of microbiome data, and propose an empirical Bayes approach to estimate microbial compositions. An efficient iterative algorithm, called classification variational approximation, is developed for carrying out maximum likelihood estimation. Moreover, we study the consistency and asymptotic normality of variational approximation estimator from the perspective of profile M-estimation. Extensive simulations and an application to a dataset from the Human Microbiome Project are presented to compare the performance of the proposed method with that of the existing methods. The method is implemented in R and available at https://github.com/YanyZeng/ZIPPCAlnm. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2356-2369 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2044827 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2044827 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2356-2369 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2089574_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Yinyin Chen Author-X-Name-First: Yinyin Author-X-Name-Last: Chen Author-Name: Shishuang He Author-X-Name-First: Shishuang Author-X-Name-Last: He Author-Name: Yun Yang Author-X-Name-First: Yun Author-X-Name-Last: Yang Author-Name: Feng Liang Author-X-Name-First: Feng Author-X-Name-Last: Liang Title: Learning Topic Models: Identifiability and Finite-Sample Analysis Abstract: Topic models provide a useful text-mining tool for learning, extracting, and discovering latent structures in large text corpora. Although a plethora of methods have been proposed for topic modeling, lacking in the literature is a formal theoretical investigation of the statistical identifiability and accuracy of latent topic estimation. In this article, we propose a maximum likelihood estimator (MLE) of latent topics based on a specific integrated likelihood that is naturally connected to the concept, in computational geometry, of volume minimization. Our theory introduces a new set of geometric conditions for topic model identifiability, conditions that are weaker than conventional separability conditions, which typically rely on the existence of pure topic documents or of anchor words. Weaker conditions allow a wider and thus potentially more fruitful investigation. We conduct finite-sample error analysis for the proposed estimator and discuss connections between our results and those of previous investigations. We conclude with empirical studies employing both simulated and real datasets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2860-2875 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2089574 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2089574 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2860-2875 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2208388_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Moritz Berger Author-X-Name-First: Moritz Author-X-Name-Last: Berger Author-Name: Ana Kowark Author-X-Name-First: Ana Author-X-Name-Last: Kowark Author-Name: Rolf Rossaint Author-X-Name-First: Rolf Author-X-Name-Last: Rossaint Author-Name: Mark Coburn Author-X-Name-First: Mark Author-X-Name-Last: Coburn Author-Name: Matthias Schmid Author-X-Name-First: Matthias Author-X-Name-Last: Schmid Author-Name: Author-X-Name-First: Author-X-Name-Last: Title: Modeling Postoperative Mortality in Older Patients by Boosting Discrete-Time Competing Risks Models Abstract: Elderly patients are at a high risk of suffering from postoperative death. Personalized strategies to improve their recovery after intervention are therefore urgently needed. A popular way to analyze postoperative mortality is to develop a prognostic model that incorporates risk factors measured at hospital admission, for example, comorbidities. When building such models, numerous issues must be addressed, including censoring and the presence of competing events (such as discharge from hospital alive). Here we present a novel survival modeling approach to investigate 30-day inpatient mortality following intervention. The proposed method accounts for both grouped event times, for example, measured in 24-hour intervals, and competing events. Conceptually, the method is embedded in the framework of generalized additive models for location, scale, and shape (GAMLSS). Model fitting is performed using a component-wise gradient boosting algorithm, which allows for additional regularization steps via stability selection. We used this new modeling approach to analyze data from the Peri-interventional Outcome Study in the Elderly (POSE), which is a recent cohort study that enrolled 9862 elderly inpatients undergoing intervention under anesthesia. Application of the proposed boosting algorithm yielded six important risk factors (including both clinical variables and interventional characteristics) that either contributed to the hazard of death or to discharge from hospital alive. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2239-2249 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2023.2208388 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2208388 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2239-2249 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2096619_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Samuel Perreault Author-X-Name-First: Samuel Author-X-Name-Last: Perreault Author-Name: Johanna G. Nešlehová Author-X-Name-First: Johanna G. Author-X-Name-Last: Nešlehová Author-Name: Thierry Duchesne Author-X-Name-First: Thierry Author-X-Name-Last: Duchesne Title: Hypothesis Tests for Structured Rank Correlation Matrices Abstract: Joint modeling of a large number of variables often requires dimension reduction strategies that lead to structural assumptions of the underlying correlation matrix, such as equal pair-wise correlations within subsets of variables. The underlying correlation matrix is thus of interest for both model specification and model validation. In this article, we develop tests of the hypothesis that the entries of the Kendall rank correlation matrix are linear combinations of a smaller number of parameters. The asymptotic behavior of the proposed test statistics is investigated both when the dimension is fixed and when it grows with the sample size. We pay special attention to the restricted hypothesis of partial exchangeability, which contains full exchangeability as a special case. We show that under partial exchangeability, the test statistics and their large-sample distributions simplify, which leads to computational advantages and better performance of the tests. We propose various scalable numerical strategies for implementation of the proposed procedures, investigate their behavior through simulations and power calculations under local alternatives, and demonstrate their use on a real dataset of mean sea levels at various geographical locations. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2889-2900 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2096619 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2096619 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2889-2900 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2060112_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Robin Dunn Author-X-Name-First: Robin Author-X-Name-Last: Dunn Author-Name: Larry Wasserman Author-X-Name-First: Larry Author-X-Name-Last: Wasserman Author-Name: Aaditya Ramdas Author-X-Name-First: Aaditya Author-X-Name-Last: Ramdas Title: Distribution-Free Prediction Sets for Two-Layer Hierarchical Models Abstract: We consider the problem of constructing distribution-free prediction sets for data from two-layer hierarchical distributions. For iid data, prediction sets can be constructed using the method of conformal prediction. The validity of conformal prediction hinges on the exchangeability of the data, which does not hold when groups of observations come from distinct distributions, such as multiple observations on each patient in a medical database. We extend conformal methods to a hierarchical setting. We develop CDF pooling, single subsampling, and repeated subsampling approaches to construct prediction sets in unsupervised and supervised settings. We compare these approaches in terms of coverage and average set size. If asymptotic coverage is acceptable, we recommend CDF pooling for its balance between empirical coverage and average set size. If we desire coverage guarantees, then we recommend the repeated subsampling approach. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2491-2502 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2060112 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2060112 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2491-2502 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2080682_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Nicholas C. Henderson Author-X-Name-First: Nicholas C. Author-X-Name-Last: Henderson Author-Name: Ravi Varadhan Author-X-Name-First: Ravi Author-X-Name-Last: Varadhan Author-Name: Thomas A. Louis Author-X-Name-First: Thomas A. Author-X-Name-Last: Louis Title: Improved Small Domain Estimation via Compromise Regression Weights Abstract: Shrinkage estimates of small domain parameters typically use a combination of a noisy “direct” estimate that only uses data from a specific small domain and a more stable regression estimate. When the regression model is misspecified, estimation performance for the noisier domains can suffer due to substantial shrinkage toward a poorly estimated regression surface. In this article, we introduce a new class of robust, empirically-driven regression weights that target estimation of the small domain means under potential misspecification of the global regression model. Our regression weights are a convex combination of the model-based weights associated with the best linear unbiased predictor (BLUP) and those associated with the observed best predictor (OBP). The mixing parameter in this convex combination is found by minimizing a novel, unbiased estimate of the mean-squared prediction error for the small domain means, and we label the associated small domain estimates the “compromise best predictor,” or CBP. Using a data-adaptive mixture for the regression weights enables the CBP to preserve the robustness of the OBP while retaining the main advantages of the EBLUP whenever the regression model is correct. We demonstrate the use of the CBP in an application estimating gait speed in older adults. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2793-2809 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2080682 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2080682 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2793-2809 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2057316_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Erin E. Gabriel Author-X-Name-First: Erin E. Author-X-Name-Last: Gabriel Author-Name: Michael C. Sachs Author-X-Name-First: Michael C. Author-X-Name-Last: Sachs Author-Name: Arvid Sjölander Author-X-Name-First: Arvid Author-X-Name-Last: Sjölander Title: Sharp Nonparametric Bounds for Decomposition Effects with Two Binary Mediators Abstract: In randomized trials, once the total effect of the intervention has been estimated, it is often of interest to explore mechanistic effects through mediators along the causal pathway between the randomized treatment and the outcome. In the setting with two sequential mediators, there are a variety of decompositions of the total risk difference into mediation effects. We derive sharp and valid bounds for a number of mediation effects in the setting of two sequential mediators both with unmeasured confounding with the outcome. We provide five such bounds in the main text corresponding to two different decompositions of the total effect, as well as the controlled direct effect, with an additional 30 novel bounds provided in the supplementary materials corresponding to the terms of 24 four-way decompositions. We also show that, although it may seem that one can produce sharp bounds by adding or subtracting the limits of the sharp bounds for terms in a decomposition, this almost always produces valid, but not sharp bounds that can even be completely noninformative. We investigate the properties of the bounds by simulating random probability distributions under our causal model and illustrate how they are interpreted in a real data example. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2446-2453 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2057316 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2057316 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2446-2453 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2075369_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Michael Guggisberg Author-X-Name-First: Michael Author-X-Name-Last: Guggisberg Title: A Bayesian Approach to Multiple-Output Quantile Regression Abstract: This article presents a Bayesian approach to multiple-output quantile regression. The prior can be elicited as ex-ante knowledge of the distance of the τ-Tukey depth contour to the Tukey median, the first prior of its kind. The parametric model is proven to be consistent and a procedure to obtain confidence intervals is proposed. A proposal for nonparametric multiple-output regression is also presented. These results add to the literature of misspecified Bayesian modeling, consistency, and prior elicitation for nonparametric multivariate modeling. The model is applied to the Tennessee Project Steps to Achieving Resilience (STAR) experiment and finds a joint increase in τ-quantile subpopulations for mathematics and reading scores given a decrease in the number of students per teacher. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2736-2745 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2075369 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2075369 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2736-2745 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2060835_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Alessandro Zito Author-X-Name-First: Alessandro Author-X-Name-Last: Zito Author-Name: Tommaso Rigon Author-X-Name-First: Tommaso Author-X-Name-Last: Rigon Author-Name: Otso Ovaskainen Author-X-Name-First: Otso Author-X-Name-Last: Ovaskainen Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Bayesian Modeling of Sequential Discoveries Abstract: We aim at modeling the appearance of distinct tags in a sequence of labeled objects. Common examples of this type of data include words in a corpus or distinct species in a sample. These sequential discoveries are often summarized via accumulation curves, which count the number of distinct entities observed in an increasingly large set of objects. We propose a novel Bayesian method for species sampling modeling by directly specifying the probability of a new discovery, therefore, allowing for flexible specifications. The asymptotic behavior and finite sample properties of such an approach are extensively studied. Interestingly, our enlarged class of sequential processes includes highly tractable special cases. We present a subclass of models characterized by appealing theoretical and computational properties, including one that shares the same discovery probability with the Dirichlet process. Moreover, due to strong connections with logistic regression models, the latter subclass can naturally account for covariates. We finally test our proposal on both synthetic and real data, with special emphasis on a large fungal biodiversity study in Finland. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2521-2532 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2060835 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2060835 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2521-2532 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2057860_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Thomas H. Scheike Author-X-Name-First: Thomas H. Author-X-Name-Last: Scheike Author-Name: Torben Martinussen Author-X-Name-First: Torben Author-X-Name-Last: Martinussen Author-Name: Brice Ozenne Author-X-Name-First: Brice Author-X-Name-Last: Ozenne Title: Efficient Estimation in the Fine and Gray Model Abstract: Direct regression for the cumulative incidence function (CIF) has become increasingly popular since the Fine and Gray model was suggested (Fine and Gray) due to its more direct interpretation on the probability risk scale. We here consider estimation within the Fine and Gray model using the theory of semiparametric efficient estimation. We show that the Fine and Gray estimator is semiparametrically efficient in the case without censoring. In the case of right-censored data, however, we show that the Fine and Gray estimator is no longer semiparametrically efficient and derive the semiparametrically efficient estimator. This estimation approach involves complicated integral equations, and we therefore also derive a simpler estimator as an augmented version of the Fine and Gray estimator with respect to the censoring nuisance space. While the augmentation term involves the CIF of the competing risk, it also leads to a robustness property: the proposed estimators remain consistent even if one of the models for the censoring mechanism or the CIF of the competing risk are misspecified. We illustrate this robustness property using simulation studies, comparing the Fine–Gray estimator and its augmented version. When the competing cause has a high cumulative incidence we see a substantial gain in efficiency from adding the augmentation term with a very reasonable computation time. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2482-2490 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2057860 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2057860 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2482-2490 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2079514_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Peiliang Bai Author-X-Name-First: Peiliang Author-X-Name-Last: Bai Author-Name: Abolfazl Safikhani Author-X-Name-First: Abolfazl Author-X-Name-Last: Safikhani Author-Name: George Michailidis Author-X-Name-First: George Author-X-Name-Last: Michailidis Title: Multiple Change Point Detection in Reduced Rank High Dimensional Vector Autoregressive Models Abstract: We study the problem of detecting and locating change points in high-dimensional Vector Autoregressive (VAR) models, whose transition matrices exhibit low rank plus sparse structure. We first address the problem of detecting a single change point using an exhaustive search algorithm and establish a finite sample error bound for its accuracy. Next, we extend the results to the case of multiple change points that can grow as a function of the sample size. Their detection is based on a two-step algorithm, wherein the first step, an exhaustive search for a candidate change point is employed for overlapping windows, and subsequently a backward elimination procedure is used to screen out redundant candidates. The two-step strategy yields consistent estimates of the number and the locations of the change points. To reduce computation cost, we also investigate conditions under which a surrogate VAR model with a weakly sparse transition matrix can accurately estimate the change points and their locations for data generated by the original model. This work also addresses and resolves a number of novel technical challenges posed by the nature of the VAR models under consideration. The effectiveness of the proposed algorithms and methodology is illustrated on both synthetic and two real datasets. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2776-2792 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2079514 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2079514 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2776-2792 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2262009_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Alan Agresti Author-X-Name-First: Alan Author-X-Name-Last: Agresti Title: Confidence Intervals for Discrete Data in Clinical Research Journal: Journal of the American Statistical Association Pages: 2945-2945 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2023.2262009 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2262009 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2945-2945 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2078331_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Sze Ming Lee Author-X-Name-First: Sze Ming Author-X-Name-Last: Lee Author-Name: Tony Sit Author-X-Name-First: Tony Author-X-Name-Last: Sit Author-Name: Gongjun Xu Author-X-Name-First: Gongjun Author-X-Name-Last: Xu Title: Efficient Estimation for Censored Quantile Regression Abstract: Censored quantile regression (CQR) has received growing attention in survival analysis because of its flexibility in modeling heterogeneous effect of covariates. Advances have been made in developing various inferential procedures under different assumptions and settings. Under the conditional independence assumption, many existing CQR methods can be characterized either by stochastic integral-based estimating equations (see, e.g., Peng and Huang) or by locally weighted approaches to adjust for the censored observations (see, for instance, Wang and Wang). While there have been proposals of different apparently dissimilar strategies in terms of formulations and the techniques applied for CQR, the inter-relationships amongst these methods are rarely discussed in the literature. In addition, given the complicated structure of the asymptotic variance, there has been limited investigation on improving the estimation efficiency for censored quantile regression models. This article addresses these open questions by proposing a unified framework under which many conventional approaches for CQR are covered as special cases. The new formulation also facilitates the construction of the most efficient estimator for the parameters of interest amongst a general class of estimating functions. Asymptotic properties including consistency and weak convergence of the proposed estimator are established via the martingale-based argument. Numerical studies are presented to illustrate the promising performance of the proposed estimator as compared to existing contenders under various settings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2762-2775 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2078331 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2078331 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2762-2775 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2069572_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Jacob Dorn Author-X-Name-First: Jacob Author-X-Name-Last: Dorn Author-Name: Kevin Guo Author-X-Name-First: Kevin Author-X-Name-Last: Guo Title: Sharp Sensitivity Analysis for Inverse Propensity Weighting via Quantile Balancing Abstract: Inverse propensity weighting (IPW) is a popular method for estimating treatment effects from observational data. However, its correctness relies on the untestable (and frequently implausible) assumption that all confounders have been measured. This article introduces a robust sensitivity analysis for IPW that estimates the range of treatment effects compatible with a given amount of unobserved confounding. The estimated range converges to the narrowest possible interval (under the given assumptions) that must contain the true treatment effect. Our proposal is a refinement of the influential sensitivity analysis by Zhao, Small, and Bhattacharya, which we show gives bounds that are too wide even asymptotically. This analysis is based on new partial identification results for Tan’s marginal sensitivity model. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2645-2657 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2069572 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2069572 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2645-2657 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2071720_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Clément Cerovecki Author-X-Name-First: Clément Author-X-Name-Last: Cerovecki Author-Name: Vaidotas Characiejus Author-X-Name-First: Vaidotas Author-X-Name-Last: Characiejus Author-Name: Siegfried Hörmann Author-X-Name-First: Siegfried Author-X-Name-Last: Hörmann Title: The Maximum of the Periodogram of a Sequence of Functional Data Abstract: We study the periodogram operator of a sequence of functional data. Using recent advances in Gaussian approximation theory, we derive the asymptotic distribution of the maximum norm over all fundamental frequencies. We consider the case where the noise variables are independent and then generalize our results to functional linear processes. Our theory can be used for detecting periodic signals in functional time series when the length of the period is unknown. We demonstrate the proposed methodology in a simulation study as well as on real data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2712-2720 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2071720 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2071720 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2712-2720 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2209349_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Nir Keret Author-X-Name-First: Nir Author-X-Name-Last: Keret Author-Name: Malka Gorfine Author-X-Name-First: Malka Author-X-Name-Last: Gorfine Title: Analyzing Big EHR Data—Optimal Cox Regression Subsampling Procedure with Rare Events Abstract: Massive sized survival datasets become increasingly prevalent with the development of the healthcare industry, and pose computational challenges unprecedented in traditional survival analysis use cases. In this work we analyze the UK-biobank colorectal cancer data with genetic and environmental risk factors, including a time-dependent coefficient, which transforms the dataset into “pseudo-observation” form, thus, critically inflating its size. A popular way for coping with massive datasets is downsampling them, such that the computational resources can be afforded by the researcher. Cox regression has remained one of the most popular statistical models for the analysis of survival data to-date. This work addresses the settings of right censored and possibly left-truncated data with rare events, such that the observed failure times constitute only a small portion of the overall sample. We propose Cox regression subsampling-based estimators that approximate their full-data partial-likelihood-based counterparts, by assigning optimal sampling probabilities to censored observations, and including all observed failures in the analysis. The suggested methodology is applied on the UK-biobank for building a colorectal cancer risk-prediction model, while reducing the computation time and memory requirements. Asymptotic properties of the proposed estimators are established under suitable regularity conditions, and simulation studies are carried out to evaluate their finite sample performance. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2262-2275 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2023.2209349 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2209349 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2262-2275 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2252041_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: George G. Vega Yon Author-X-Name-First: George G. Author-X-Name-Last: Vega Yon Title: Power and Multicollinearity in Small Networks: A Discussion of “Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks” by Krivitsky, Coletti, and Hens Journal: Journal of the American Statistical Association Pages: 2228-2231 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2023.2252041 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2252041 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2228-2231 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2054817_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Jing Lei Author-X-Name-First: Jing Author-X-Name-Last: Lei Author-Name: Kevin Z. Lin Author-X-Name-First: Kevin Z. Author-X-Name-Last: Lin Title: Bias-Adjusted Spectral Clustering in Multi-Layer Stochastic Block Models Abstract: We consider the problem of estimating common community structures in multi-layer stochastic block models, where each single layer may not have sufficient signal strength to recover the full community structure. In order to efficiently aggregate signal across different layers, we argue that the sum-of-squared adjacency matrices contain sufficient signal even when individual layers are very sparse. Our method uses a bias-removal step that is necessary when the squared noise matrices may overwhelm the signal in the very sparse regime. The analysis of our method relies on several novel tail probability bounds for matrix linear combinations with matrix-valued coefficients and matrix-valued quadratic forms, which may be of independent interest. The performance of our method and the necessity of bias removal is demonstrated in synthetic data and in microarray analysis about gene co-expression networks. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2433-2445 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2054817 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2054817 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2433-2445 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2257267_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Nynke M. D. Niezink Author-X-Name-First: Nynke M. D. Author-X-Name-Last: Niezink Title: Discussion of “A Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks” Journal: Journal of the American Statistical Association Pages: 2232-2234 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2023.2257267 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2257267 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2232-2234 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2086132_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: David T. Frazier Author-X-Name-First: David T. Author-X-Name-Last: Frazier Author-Name: David J. Nott Author-X-Name-First: David J. Author-X-Name-Last: Nott Author-Name: Christopher Drovandi Author-X-Name-First: Christopher Author-X-Name-Last: Drovandi Author-Name: Robert Kohn Author-X-Name-First: Robert Author-X-Name-Last: Kohn Title: Bayesian Inference Using Synthetic Likelihood: Asymptotics and Adjustments Abstract: Implementing Bayesian inference is often computationally challenging in complex models, especially when calculating the likelihood is difficult. Synthetic likelihood is one approach for carrying out inference when the likelihood is intractable, but it is straightforward to simulate from the model. The method constructs an approximate likelihood by taking a vector summary statistic as being multivariate normal, with the unknown mean and covariance estimated by simulation. Previous research demonstrates that the Bayesian implementation of synthetic likelihood can be more computationally efficient than approximate Bayesian computation, a popular likelihood-free method, in the presence of a high-dimensional summary statistic. Our article makes three contributions. The first shows that if the summary statistics are well-behaved, then the synthetic likelihood posterior is asymptotically normal and yields credible sets with the correct level of coverage. The second contribution compares the computational efficiency of Bayesian synthetic likelihood and approximate Bayesian computation. We show that Bayesian synthetic likelihood is computationally more efficient than approximate Bayesian computation. Based on the asymptotic results, the third contribution proposes using adjusted inference methods when a possibly misspecified form is assumed for the covariance matrix of the synthetic likelihood, such as diagonal or a factor model, to speed up computation. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2821-2832 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2086132 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2086132 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2821-2832 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2060113_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Chenguang Dai Author-X-Name-First: Chenguang Author-X-Name-Last: Dai Author-Name: Buyu Lin Author-X-Name-First: Buyu Author-X-Name-Last: Lin Author-Name: Xin Xing Author-X-Name-First: Xin Author-X-Name-Last: Xing Author-Name: Jun S. Liu Author-X-Name-First: Jun S. Author-X-Name-Last: Liu Title: False Discovery Rate Control via Data Splitting Abstract: Selecting relevant features associated with a given response variable is an important problem in many scientific fields. Quantifying quality and uncertainty of a selection result via false discovery rate (FDR) control has been of recent interest. This article introduces a data-splitting method (referred to as “DS”) to asymptotically control the FDR while maintaining a high power. For each feature, DS constructs a test statistic by estimating two independent regression coefficients via data splitting. FDR control is achieved by taking advantage of the statistic’s property that, for any null feature, its sampling distribution is symmetric about zero; whereas for a relevant feature, its sampling distribution has a positive mean. Furthermore, a Multiple Data Splitting (MDS) method is proposed to stabilize the selection result and boost the power. Surprisingly, with the FDR under control, MDS not only helps overcome the power loss caused by data splitting, but also results in a lower variance of the false discovery proportion (FDP) compared with all other methods in consideration. Extensive simulation studies and a real-data application show that the proposed methods are robust to the unknown distribution of features, easy to implement and computationally efficient, and are often the most powerful ones among competitors especially when the signals are weak and correlations or partial correlations among features are high. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2503-2520 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2060113 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2060113 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2503-2520 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2050244_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Marco Avarucci Author-X-Name-First: Marco Author-X-Name-Last: Avarucci Author-Name: Paolo Zaffaroni Author-X-Name-First: Paolo Author-X-Name-Last: Zaffaroni Title: Robust Estimation of Large Panels with Factor Structures Abstract: This article studies estimation of linear panel regression models with heterogeneous coefficients using a class of weighted least squares estimators, when both the regressors and the error possibly contain a common latent factor structure. Our theory is robust to the specification of such a factor structure because it does not require any information on the number of factors or estimation of the factor structure itself. Moreover, our theory is efficient, in certain circumstances, because it nests the GLS principle. We first show how our unfeasible weighted-estimator provides a bias-adjusted estimator with the conventional limiting distribution, for situations in which the OLS is affected by a first-order bias. The technical challenge resolved in the article consists of showing how these properties are preserved for the feasible weighted estimator in a double-asymptotics setting. Our theory is illustrated by extensive Monte Carlo experiments and an empirical application that investigates the link between capital accumulation and economic growth in an international setting. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2394-2405 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2050244 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2050244 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2394-2405 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2199814_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Yao Zhang Author-X-Name-First: Yao Author-X-Name-Last: Zhang Author-Name: Qingyuan Zhao Author-X-Name-First: Qingyuan Author-X-Name-Last: Zhao Title: What is a Randomization Test? Abstract: The meaning of randomization tests has become obscure in statistics education and practice over the last century. This article makes a fresh attempt at rectifying this core concept of statistics. A new term—“quasi-randomization test”—is introduced to define significance tests based on theoretical models and distinguish these tests from the “randomization tests” based on the physical act of randomization. The practical importance of this distinction is illustrated through a real stepped-wedge cluster-randomized trial. Building on the recent literature on randomization inference, a general framework of conditional randomization tests is developed and some practical methods to construct conditioning events are given. The proposed terminology and framework are then applied to understand several widely used (quasi-)randomization tests, including Fisher’s exact test, permutation tests for treatment effect, quasi-randomization tests for independence and conditional independence, adaptive randomization, and conformal prediction. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2928-2942 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2023.2199814 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2199814 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2928-2942 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2044826_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Peter Kramlinger Author-X-Name-First: Peter Author-X-Name-Last: Kramlinger Author-Name: Tatyana Krivobokova Author-X-Name-First: Tatyana Author-X-Name-Last: Krivobokova Author-Name: Stefan Sperlich Author-X-Name-First: Stefan Author-X-Name-Last: Sperlich Title: Marginal and Conditional Multiple Inference for Linear Mixed Model Predictors Abstract: In spite of its high practical relevance, cluster specific multiple inference for linear mixed model predictors has hardly been addressed so far. While marginal inference for population parameters is well understood, conditional inference for the cluster specific predictors is more intricate. This work introduces a general framework for multiple inference in linear mixed models for cluster specific predictors. Consistent confidence sets for multiple inference are constructed under both, the marginal and the conditional law. Furthermore, it is shown that, remarkably, corresponding multiple marginal confidence sets are also asymptotically valid for conditional inference. Those lend themselves for testing linear hypotheses using standard quantiles without the need of resampling techniques. All findings are validated in simulations and illustrated along a study on Covid-19 mortality in the U.S. state prisons. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2344-2355 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2044826 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2044826 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2344-2355 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2049278_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Ting Ye Author-X-Name-First: Ting Author-X-Name-Last: Ye Author-Name: Jun Shao Author-X-Name-First: Jun Author-X-Name-Last: Shao Author-Name: Yanyao Yi Author-X-Name-First: Yanyao Author-X-Name-Last: Yi Author-Name: Qingyuan Zhao Author-X-Name-First: Qingyuan Author-X-Name-Last: Zhao Title: Toward Better Practice of Covariate Adjustment in Analyzing Randomized Clinical Trials Abstract: In randomized clinical trials, adjustments for baseline covariates at both design and analysis stages are highly encouraged by regulatory agencies. A recent trend is to use a model-assisted approach for covariate adjustment to gain credibility and efficiency while producing asymptotically valid inference even when the model is incorrect. In this article we present three considerations for better practice when model-assisted inference is applied to adjust for covariates under simple or covariate-adaptive randomized trials: (a) guaranteed efficiency gain: a model-assisted method should often gain but never hurt efficiency; (b) wide applicability: a valid procedure should be applicable, and preferably universally applicable, to all commonly used randomization schemes; (c) robust standard error: variance estimation should be robust to model misspecification and heteroscedasticity. To achieve these, we recommend a model-assisted estimator under an analysis of heterogeneous covariance working model that includes all covariates used in randomization. Our conclusions are based on an asymptotic theory that provides a clear picture of how covariate-adaptive randomization and regression adjustment alter statistical efficiency. Our theory is more general than the existing ones in terms of studying arbitrary functions of response means (including linear contrasts, ratios, and odds ratios), multiple arms, guaranteed efficiency gain, optimality, and universal applicability. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2370-2382 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2049278 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2049278 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2370-2382 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2068419_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Yichi Zhang Author-X-Name-First: Yichi Author-X-Name-Last: Zhang Author-Name: Weining Shen Author-X-Name-First: Weining Author-X-Name-Last: Shen Author-Name: Dehan Kong Author-X-Name-First: Dehan Author-X-Name-Last: Kong Title: Covariance Estimation for Matrix-valued Data Abstract: Covariance estimation for matrix-valued data has received an increasing interest in applications. Unlike previous works that rely heavily on matrix normal distribution assumption and the requirement of fixed matrix size, we propose a class of distribution-free regularized covariance estimation methods for high-dimensional matrix data under a separability condition and a bandable covariance structure. Under these conditions, the original covariance matrix is decomposed into a Kronecker product of two bandable small covariance matrices representing the variability over row and column directions. We formulate a unified framework for estimating bandable covariance, and introduce an efficient algorithm based on rank one unconstrained Kronecker product approximation. The convergence rates of the proposed estimators are established, and the derived minimax lower bound shows our proposed estimator is rate-optimal under certain divergence regimes of matrix size. We further introduce a class of robust covariance estimators and provide theoretical guarantees to deal with heavy-tailed data. We demonstrate the superior finite-sample performance of our methods using simulations and real applications from a gridded temperature anomalies dataset and an S&P 500 stock data analysis. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2620-2631 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2068419 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2068419 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2620-2631 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2210336_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Ziyi Li Author-X-Name-First: Ziyi Author-X-Name-Last: Li Author-Name: Yu Shen Author-X-Name-First: Yu Author-X-Name-Last: Shen Author-Name: Jing Ning Author-X-Name-First: Jing Author-X-Name-Last: Ning Title: Accommodating Time-Varying Heterogeneity in Risk Estimation under the Cox Model: A Transfer Learning Approach Abstract: Transfer learning has attracted increasing attention in recent years for adaptively borrowing information across different data cohorts in various settings. Cancer registries have been widely used in clinical research because of their easy accessibility and large sample size. Our method is motivated by the question of how to use cancer registry data as a complement to improve the estimation precision of individual risks of death for inflammatory breast cancer (IBC) patients at The University of Texas MD Anderson Cancer Center. When transferring information for risk estimation based on the cancer registries (i.e., source cohort) to a single cancer center (i.e., target cohort), time-varying population heterogeneity needs to be appropriately acknowledged. However, there is no literature on how to adaptively transfer knowledge on risk estimation with time-to-event data from the source cohort to the target cohort while adjusting for time-varying differences in event risks between the two sources. Our goal is to address this statistical challenge by developing a transfer learning approach under the Cox proportional hazards model. To allow data-adaptive levels of information borrowing, we impose Lasso penalties on the discrepancies in regression coefficients and baseline hazard functions between the two cohorts, which are jointly solved in the proposed transfer learning algorithm. As shown in the extensive simulation studies, the proposed method yields more precise individualized risk estimation than using the target cohort alone. Meanwhile, our method demonstrates satisfactory robustness against cohort differences compared with the method that directly combines the target and source data in the Cox model. We develop a more accurate risk estimation model for the MD Anderson IBC cohort given various treatment and baseline covariates, while adaptively borrowing information from the National Cancer Database to improve risk assessment. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2276-2287 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2023.2210336 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2210336 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2276-2287 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2063131_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Jian-Feng Cai Author-X-Name-First: Jian-Feng Author-X-Name-Last: Cai Author-Name: Jingyang Li Author-X-Name-First: Jingyang Author-X-Name-Last: Li Author-Name: Dong Xia Author-X-Name-First: Dong Author-X-Name-Last: Xia Title: Generalized Low-Rank Plus Sparse Tensor Estimation by Fast Riemannian Optimization Abstract: We investigate a generalized framework to estimate a latent low-rank plus sparse tensor, where the low-rank tensor often captures the multi-way principal components and the sparse tensor accounts for potential model mis-specifications or heterogeneous signals that are unexplainable by the low-rank part. The framework flexibly covers both linear and generalized linear models, and can easily handle continuous or categorical variables. We propose a fast algorithm by integrating the Riemannian gradient descent and a novel gradient pruning procedure. Under suitable conditions, the algorithm converges linearly and can simultaneously estimate both the low-rank and sparse tensors. The statistical error bounds of final estimates are established in terms of the gradient of loss function. The error bounds are generally sharp under specific statistical models, for example, the sub-Gaussian robust PCA and Bernoulli tensor model. Moreover, our method achieves nontrivial error bounds for heavy-tailed tensor PCA whenever the noise has a finite 2+ε moment. We apply our method to analyze the international trade flow dataset and the statistician hypergraph coauthorship network, both yielding new and interesting findings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2588-2604 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2063131 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2063131 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2588-2604 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2280383_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Pavel N. Krivitsky Author-X-Name-First: Pavel N. Author-X-Name-Last: Krivitsky Author-Name: Pietro Coletti Author-X-Name-First: Pietro Author-X-Name-Last: Coletti Author-Name: Niel Hens Author-X-Name-First: Niel Author-X-Name-Last: Hens Title: Rejoinder to Discussion of “A Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks” Journal: Journal of the American Statistical Association Pages: 2235-2238 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2023.2280383 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2280383 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2235-2238 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2231581_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Noirrit Kiran Chandra Author-X-Name-First: Noirrit Kiran Author-X-Name-Last: Chandra Author-Name: Abhra Sarkar Author-X-Name-First: Abhra Author-X-Name-Last: Sarkar Author-Name: John F. de Groot Author-X-Name-First: John F. Author-X-Name-Last: de Groot Author-Name: Ying Yuan Author-X-Name-First: Ying Author-X-Name-Last: Yuan Author-Name: Peter Müller Author-X-Name-First: Peter Author-X-Name-Last: Müller Title: Bayesian Nonparametric Common Atoms Regression for Generating Synthetic Controls in Clinical Trials Abstract: The availability of electronic health records (EHR) has opened opportunities to supplement increasingly expensive and difficult to carry out randomized controlled trials (RCT) with evidence from readily available real world data. In this article, we use EHR data to construct synthetic control arms for treatment-only single arm trials. We propose a novel nonparametric Bayesian common atoms mixture model that allows us to find equivalent population strata in the EHR and the treatment arm and then resample the EHR data to create equivalent patient populations under both the single arm trial and the resampled EHR. Resampling is implemented via a density-free importance sampling scheme. Using the synthetic control arm, inference for the treatment effect can then be carried out using any method available for RCTs. Alternatively the proposed nonparametric Bayesian model allows straightforward model-based inference. In simulation experiments, the proposed method exhibits higher power than alternative methods in detecting treatment effects, specifically for nonlinear response functions. We apply the method to supplement single arm treatment-only glioblastoma studies with a synthetic control arm based on historical trials. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2301-2314 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2023.2231581 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2231581 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2301-2314 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2068420_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Yu Zhou Author-X-Name-First: Yu Author-X-Name-Last: Zhou Author-Name: Lan Wang Author-X-Name-First: Lan Author-X-Name-Last: Wang Author-Name: Rui Song Author-X-Name-First: Rui Author-X-Name-Last: Song Author-Name: Tuoyi Zhao Author-X-Name-First: Tuoyi Author-X-Name-Last: Zhao Title: Transformation-Invariant Learning of Optimal Individualized Decision Rules with Time-to-Event Outcomes Abstract: In many important applications of precision medicine, the outcome of interest is time to an event (e.g., death, relapse of disease) and the primary goal is to identify the optimal individualized decision rule (IDR) to prolong survival time. Existing work in this area have been mostly focused on estimating the optimal IDR to maximize the restricted mean survival time in the population. We propose a new robust framework for estimating an optimal static or dynamic IDR with time-to-event outcomes based on an easy-to-interpret quantile criterion. The new method does not need to specify an outcome regression model and is robust for heavy-tailed distribution. The estimation problem corresponds to a nonregular M-estimation problem with both finite and infinite-dimensional nuisance parameters. Employing advanced empirical process techniques, we establish the statistical theory of the estimated parameter indexing the optimal IDR. Furthermore, we prove a novel result that the proposed approach can consistently estimate the optimal value function under mild conditions even when the optimal IDR is nonunique, which happens in the challenging setting of exceptional laws. We also propose a smoothed resampling procedure for inference. The proposed methods are implemented in the R-package QTOCen. We demonstrate the performance of the proposed new methods via extensive Monte Carlo studies and a real data application. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2632-2644 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2068420 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2068420 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2632-2644 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2071276_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Mats J. Stensrud Author-X-Name-First: Mats J. Author-X-Name-Last: Stensrud Author-Name: James M. Robins Author-X-Name-First: James M. Author-X-Name-Last: Robins Author-Name: Aaron Sarvet Author-X-Name-First: Aaron Author-X-Name-Last: Sarvet Author-Name: Eric J. Tchetgen Tchetgen Author-X-Name-First: Eric J. Author-X-Name-Last: Tchetgen Tchetgen Author-Name: Jessica G. Young Author-X-Name-First: Jessica G. Author-X-Name-Last: Young Title: Conditional Separable Effects Abstract: Researchers are often interested in treatment effects on outcomes that are only defined conditional on posttreatment events. For example, in a study of the effect of different cancer treatments on quality of life at end of follow-up, the quality of life of individuals who die during the study is undefined. In these settings, naive contrasts of outcomes conditional on posttreatment events are not average causal effects, even in randomized experiments. Therefore, the effect in the principal stratum of those who would have the same value of the posttreatment variable regardless of treatment (such as the survivor average causal effect) is often advocated for causal inference. While principal stratum effects are average causal effects, they refer to a subset of the population that cannot be observed and may not exist. Therefore, it is not clear how these effects inform decisions or policies. Here we propose the conditional separable effects, quantifying causal effects of modified versions of the study treatment in an observable subset of the population. These effects, which may quantify direct effects of the study treatment, require transparent reasoning about candidate modified treatments and their mechanisms. We provide identifying conditions and various estimators of these effects along with an applied example. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2671-2683 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2071276 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2071276 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2671-2683 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2071721_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Hang Deng Author-X-Name-First: Hang Author-X-Name-Last: Deng Author-Name: Qiyang Han Author-X-Name-First: Qiyang Author-X-Name-Last: Han Author-Name: Bodhisattva Sen Author-X-Name-First: Bodhisattva Author-X-Name-Last: Sen Title: Inference for Local Parameters in Convexity Constrained Models Abstract: In this article, we develop automated inference methods for “local” parameters in a collection of convexity constrained models based on the natural constrained tuning-free estimators. A canonical example is given by the univariate convex regression model, in which automated inference is drawn for the function value, the function derivative at a fixed interior point, and the anti-mode of the convex regression function, based on the widely used tuning-free, piecewise linear convex least squares estimator (LSE). The key to our inference proposal in this model is a pivotal joint limit distribution theory for the LS estimates of the local parameters, normalized appropriately by the length of certain data-driven linear piece of the convex LSE. Such a pivotal limiting distribution instantly gives rise to confidence intervals for these local parameters, whose construction requires almost no more effort than computing the convex LSE itself. This inference method in the convex regression model is a special case of a general inference machinery that covers a number of convexity constrained models in which a limit distribution theory is available for model-specific estimators. Concrete models include: (i) log-concave density estimation, (ii) s-concave density estimation, (iii) convex nonincreasing density estimation, (iv) concave bathtub-shaped hazard function estimation, and (v) concave distribution function estimation from corrupted data. The proposed confidence intervals for all these models are proved to have asymptotically exact coverage and oracle length, and require no further information than the estimator itself. We provide extensive simulation evidence that validates our theoretical results. Real data applications and comparisons with competing methods are given to illustrate the usefulness of our inference proposals. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2721-2735 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2071721 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2071721 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2721-2735 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2089573_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Wenbo Wang Author-X-Name-First: Wenbo Author-X-Name-Last: Wang Author-Name: Xingye Qiao Author-X-Name-First: Xingye Author-X-Name-Last: Qiao Title: Set-Valued Support Vector Machine with Bounded Error Rates Abstract: This article concerns cautious classification models that are allowed to predict a set of class labels or reject to make a prediction when the uncertainty in the prediction is high. This set-valued classification approach is equivalent to the task of acceptance region learning, which aims to identify subsets of the input space, each of which guarantees to cover observations in a class with at least a predetermined probability. We propose to directly learn the acceptance regions through risk minimization, by making use of a truncated hinge loss and a constrained optimization framework. Collectively our theoretical analyses show that these acceptance regions, with high probability, satisfy simultaneously two properties: (a) they guarantee to cover each class with a noncoverage rate bounded from above; (b) they give the least ambiguous predictions among all the acceptance regions satisfying (a). An efficient algorithm is developed and numerical studies are conducted using both simulated and real data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2847-2859 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2089573 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2089573 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2847-2859 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2057859_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Akihiko Nishimura Author-X-Name-First: Akihiko Author-X-Name-Last: Nishimura Author-Name: Marc A. Suchard Author-X-Name-First: Marc A. Author-X-Name-Last: Suchard Title: Prior-Preconditioned Conjugate Gradient Method for Accelerated Gibbs Sampling in “Large n, Large p” Bayesian Sparse Regression Abstract: In a modern observational study based on healthcare databases, the number of observations and of predictors typically range in the order of 105–106 and of 104–105. Despite the large sample size, data rarely provide sufficient information to reliably estimate such a large number of parameters. Sparse regression techniques provide potential solutions, one notable approach being the Bayesian method based on shrinkage priors. In the “large n and large p” setting, however, the required posterior computation encounters a bottleneck at repeated sampling from a high-dimensional Gaussian distribution, whose precision matrix Φ is expensive to compute and factorize. In this article, we present a novel algorithm to speed up this bottleneck based on the following observation: We can cheaply generate a random vector b such that the solution to the linear system Φβ=b has the desired Gaussian distribution. We can then solve the linear system by the conjugate gradient (CG) algorithm through matrix-vector multiplications by Φ; this involves no explicit factorization or calculation of Φ itself. Rapid convergence of CG in this context is guaranteed by the theory of prior-preconditioning we develop. We apply our algorithm to a clinically relevant large-scale observational study with n=72,489 patients and p=22,175 clinical covariates, designed to assess the relative risk of adverse events from two alternative blood anti-coagulants. Our algorithm demonstrates an order of magnitude speed-up in posterior inference, in our case cutting the computation time from two weeks to less than a day. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2468-2481 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2057859 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2057859 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2468-2481 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2096620_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Pratik Ramprasad Author-X-Name-First: Pratik Author-X-Name-Last: Ramprasad Author-Name: Yuantong Li Author-X-Name-First: Yuantong Author-X-Name-Last: Li Author-Name: Zhuoran Yang Author-X-Name-First: Zhuoran Author-X-Name-Last: Yang Author-Name: Zhaoran Wang Author-X-Name-First: Zhaoran Author-X-Name-Last: Wang Author-Name: Will Wei Sun Author-X-Name-First: Will Wei Author-X-Name-Last: Sun Author-Name: Guang Cheng Author-X-Name-First: Guang Author-X-Name-Last: Cheng Title: Online Bootstrap Inference For Policy Evaluation In Reinforcement Learning Abstract: The recent emergence of reinforcement learning (RL) has created a demand for robust statistical inference methods for the parameter estimates computed using these algorithms. Existing methods for inference in online learning are restricted to settings involving independently sampled observations, while inference methods in RL have so far been limited to the batch setting. The bootstrap is a flexible and efficient approach for statistical inference in online learning algorithms, but its efficacy in settings involving Markov noise, such as RL, has yet to be explored. In this article, we study the use of the online bootstrap method for inference in RL policy evaluation. In particular, we focus on the temporal difference (TD) learning and Gradient TD (GTD) learning algorithms, which are themselves special instances of linear stochastic approximation under Markov noise. The method is shown to be distributionally consistent for statistical inference in policy evaluation, and numerical experiments are included to demonstrate the effectiveness of this algorithm across a range of real RL environments. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2901-2914 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2096620 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2096620 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2901-2914 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2066537_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Hilda S. Ibriga Author-X-Name-First: Hilda S. Author-X-Name-Last: Ibriga Author-Name: Will Wei Sun Author-X-Name-First: Will Wei Author-X-Name-Last: Sun Title: Covariate-Assisted Sparse Tensor Completion Abstract: We aim to provably complete a sparse and highly missing tensor in the presence of covariate information along tensor modes. Our motivation comes from online advertising where users’ click-through-rates (CTR) on ads over various devices form a CTR tensor that has about 96% missing entries and has many zeros on nonmissing entries, which makes the standalone tensor completion method unsatisfactory. Beside the CTR tensor, additional ad features or user characteristics are often available. In this article, we propose Covariate-assisted Sparse Tensor Completion (COSTCO) to incorporate covariate information for the recovery of the sparse tensor. The key idea is to jointly extract latent components from both the tensor and the covariate matrix to learn a synthetic representation. Theoretically, we derive the error bound for the recovered tensor components and explicitly quantify the improvements on both the reveal probability condition and the tensor recovery accuracy due to covariates. Finally, we apply COSTCO to an advertisement dataset consisting of a CTR tensor and ad covariate matrix, leading to 23% accuracy improvement over the baseline. An important by-product is that ad latent components from COSTCO reveal interesting ad clusters, which are useful for better ad targeting. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2605-2619 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2066537 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2066537 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2605-2619 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2242627_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Pavel N. Krivitsky Author-X-Name-First: Pavel N. Author-X-Name-Last: Krivitsky Author-Name: Pietro Coletti Author-X-Name-First: Pietro Author-X-Name-Last: Coletti Author-Name: Niel Hens Author-X-Name-First: Niel Author-X-Name-Last: Hens Title: A Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks Abstract: The last two decades have seen considerable progress in foundational aspects of statistical network analysis, but the path from theory to application is not straightforward. Two large, heterogeneous samples of small networks of within-household contacts in Belgium were collected using two different but complementary sampling designs: one smaller but with all contacts in each household observed, the other larger and more representative but recording contacts of only one person per household. We wish to combine their strengths to learn the social forces that shape household contact formation and facilitate simulation for prediction of disease spread, while generalising to the population of households in the region. To accomplish this, we describe a flexible framework for specifying multi-network models in the exponential family class and identify the requirements for inference and prediction under this framework to be consistent, identifiable, and generalisable, even when data are incomplete; explore how these requirements may be violated in practice; and develop a suite of quantitative and graphical diagnostics for detecting violations and suggesting improvements to candidate models. We report on the effects of network size, geography, and household roles on household contact patterns (activity, heterogeneity in activity, and triadic closure). Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2213-2224 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2023.2242627 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2242627 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2213-2224 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2078330_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Mary Lai O. Salvaña Author-X-Name-First: Mary Lai O. Author-X-Name-Last: Salvaña Author-Name: Amanda Lenzi Author-X-Name-First: Amanda Author-X-Name-Last: Lenzi Author-Name: Marc G. Genton Author-X-Name-First: Marc G. Author-X-Name-Last: Genton Title: Spatio-Temporal Cross-Covariance Functions under the Lagrangian Framework with Multiple Advections Abstract: When analyzing the spatio-temporal dependence in most environmental and earth sciences variables such as pollutant concentrations at different levels of the atmosphere, a special property is observed: the covariances and cross-covariances are stronger in certain directions. This property is attributed to the presence of natural forces, such as wind, which cause the transport and dispersion of these variables. This spatio-temporal dynamics prompted the use of the Lagrangian reference frame alongside any Gaussian spatio-temporal geostatistical model. Under this modeling framework, a whole new class was birthed and was known as the class of spatio-temporal covariance functions under the Lagrangian framework, with several developments already established in the univariate setting, in both stationary and nonstationary formulations, but less so in the multivariate case. Despite the many advances in this modeling approach, efforts have yet to be directed to probing the case for the use of multiple advections, especially when several variables are involved. Accounting for multiple advections would make the Lagrangian framework a more viable approach in modeling realistic multivariate transport scenarios. In this work, we establish a class of Lagrangian spatio-temporal cross-covariance functions with multiple advections, study its properties, and demonstrate its use on a bivariate pollutant dataset of particulate matter in Saudi Arabia. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2746-2761 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2078330 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2078330 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2746-2761 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2063130_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Zhaoxue Tong Author-X-Name-First: Zhaoxue Author-X-Name-Last: Tong Author-Name: Zhanrui Cai Author-X-Name-First: Zhanrui Author-X-Name-Last: Cai Author-Name: Songshan Yang Author-X-Name-First: Songshan Author-X-Name-Last: Yang Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Title: Model-Free Conditional Feature Screening with FDR Control Abstract: In this article, we propose a model-free conditional feature screening method with false discovery rate (FDR) control for ultra-high dimensional data. The proposed method is built upon a new measure of conditional independence. Thus, the new method does not require a specific functional form of the regression function and is robust to heavy-tailed responses and predictors. The variables to be conditional on are allowed to be multivariate. The proposed method enjoys sure screening and ranking consistency properties under mild regularity conditions. To control the FDR, we apply the Reflection via Data Splitting method and prove its theoretical guarantee using martingale theory and empirical process techniques. Simulated examples and real data analysis show that the proposed method performs very well compared with existing works. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2575-2587 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2063130 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2063130 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2575-2587 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2099402_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Augustin Chevallier Author-X-Name-First: Augustin Author-X-Name-Last: Chevallier Author-Name: Paul Fearnhead Author-X-Name-First: Paul Author-X-Name-Last: Fearnhead Author-Name: Matthew Sutton Author-X-Name-First: Matthew Author-X-Name-Last: Sutton Title: Reversible Jump PDMP Samplers for Variable Selection Abstract: A new class of Markov chain Monte Carlo (MCMC) algorithms, based on simulating piecewise deterministic Markov processes (PDMPs), has recently shown great promise: they are nonreversible, can mix better than standard MCMC algorithms, and can use subsampling ideas to speed up computation in big data scenarios. However, current PDMP samplers can only sample from posterior densities that are differentiable almost everywhere, which precludes their use for model choice. Motivated by variable selection problems, we show how to develop reversible jump PDMP samplers that can jointly explore the discrete space of models and the continuous space of parameters. Our framework is general: it takes any existing PDMP sampler, and adds two types of trans-dimensional moves that allow for the addition or removal of a variable from the model. We show how the rates of these trans-dimensional moves can be calculated so that the sampler has the correct invariant distribution. We remove a variable from a model when the associated parameter is zero, and this means that the rates of the trans-dimensional moves do not depend on the likelihood. It is, thus, easy to implement a reversible jump version of any PDMP sampler that can explore a fixed model. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2915-2927 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2099402 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2099402 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2915-2927 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2081575_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Snigdha Panigrahi Author-X-Name-First: Snigdha Author-X-Name-Last: Panigrahi Author-Name: Jonathan Taylor Author-X-Name-First: Jonathan Author-X-Name-Last: Taylor Title: Approximate Selective Inference via Maximum Likelihood Abstract: Several strategies have been developed recently to ensure valid inference after model selection; some of these are easy to compute, while others fare better in terms of inferential power. In this article, we consider a selective inference framework for Gaussian data. We propose a new method for inference through approximate maximum likelihood estimation. Our goal is to: (a) achieve better inferential power with the aid of randomization, (b) bypass expensive MCMC sampling from exact conditional distributions that are hard to evaluate in closed forms. We construct approximate inference, for example, p-values, confidence intervals etc., by solving a fairly simple, convex optimization problem. We illustrate the potential of our method across wide-ranging values of signal-to-noise ratio in simulations. On a cancer gene expression dataset we find that our method improves upon the inferential power of some commonly used strategies for selective inference. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2810-2820 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2081575 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2081575 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2810-2820 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2225742_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Tianchen Xu Author-X-Name-First: Tianchen Author-X-Name-Last: Xu Author-Name: Yuan Chen Author-X-Name-First: Yuan Author-X-Name-Last: Chen Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Yuanjia Wang Author-X-Name-First: Yuanjia Author-X-Name-Last: Wang Title: Mixed-Response State-Space Model for Analyzing Multi-Dimensional Digital Phenotypes Abstract: Digital technologies (e.g., mobile phones) can be used to obtain objective, frequent, and real-world digital phenotypes from individuals. However, modeling these data poses substantial challenges since observational data are subject to confounding and various sources of variabilities. For example, signals on patients’ underlying health status and treatment effects are mixed with variation due to the living environment and measurement noises. The digital phenotype data thus shows extensive variabilities between- and within-patient as well as across different health domains (e.g., motor, cognitive, and speaking). Motivated by a mobile health study of Parkinson’s disease (PD), we develop a mixed-response state-space (MRSS) model to jointly capture multi-dimensional, multi-modal digital phenotypes and their measurement processes by a finite number of latent state time series. These latent states reflect the dynamic health status and personalized time-varying treatment effects and can be used to adjust for informative measurements. For computation, we use the Kalman filter for Gaussian phenotypes and importance sampling with Laplace approximation for non-Gaussian phenotypes. We conduct comprehensive simulation studies and demonstrate the advantage of MRSS in modeling a mobile health study that remotely collects real-time digital phenotypes from PD patients. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2288-2300 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2023.2225742 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2225742 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2288-2300 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2071278_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Ye Tian Author-X-Name-First: Ye Author-X-Name-Last: Tian Author-Name: Yang Feng Author-X-Name-First: Yang Author-X-Name-Last: Feng Title: Transfer Learning Under High-Dimensional Generalized Linear Models Abstract: In this work, we study the transfer learning problem under high-dimensional generalized linear models (GLMs), which aim to improve the fit on target data by borrowing information from useful source data. Given which sources to transfer, we propose a transfer learning algorithm on GLM, and derive its l1/l2-estimation error bounds as well as a bound for a prediction error measure. The theoretical analysis shows that when the target and sources are sufficiently close to each other, these bounds could be improved over those of the classical penalized estimator using only target data under mild conditions. When we don’t know which sources to transfer, an algorithm-free transferable source detection approach is introduced to detect informative sources. The detection consistency is proved under the high-dimensional GLM transfer learning setting. We also propose an algorithm to construct confidence intervals of each coefficient component, and the corresponding theories are provided. Extensive simulations and a real-data experiment verify the effectiveness of our algorithms. We implement the proposed GLM transfer learning algorithms in a new R package glmtrans, which is available on CRAN. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2684-2697 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2071278 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2071278 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2684-2697 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2053137_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Ruijian Han Author-X-Name-First: Ruijian Author-X-Name-Last: Han Author-Name: Yiming Xu Author-X-Name-First: Yiming Author-X-Name-Last: Xu Author-Name: Kani Chen Author-X-Name-First: Kani Author-X-Name-Last: Chen Title: A General Pairwise Comparison Model for Extremely Sparse Networks Abstract: Statistical estimation using pairwise comparison data is an effective approach to analyzing large-scale sparse networks. In this article, we propose a general framework to model the mutual interactions in a network, which enjoys ample flexibility in terms of model parameterization. Under this setup, we show that the maximum likelihood estimator for the latent score vector of the subjects is uniformly consistent under a near-minimal condition on network sparsity. This condition is sharp in terms of the leading order asymptotics describing the sparsity. Our analysis uses a novel chaining technique and illustrates an important connection between graph topology and model consistency. Our results guarantee that the maximum likelihood estimator is justified for estimation in large-scale pairwise comparison networks where data are asymptotically deficient. Simulation studies are provided in support of our theoretical findings. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2422-2432 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2053137 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2053137 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2422-2432 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2050243_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Kean Ming Tan Author-X-Name-First: Kean Ming Author-X-Name-Last: Tan Author-Name: Qiang Sun Author-X-Name-First: Qiang Author-X-Name-Last: Sun Author-Name: Daniela Witten Author-X-Name-First: Daniela Author-X-Name-Last: Witten Title: Sparse Reduced Rank Huber Regression in High Dimensions Abstract: We propose a sparse reduced rank Huber regression for analyzing large and complex high-dimensional data with heavy-tailed random noise. The proposed method is based on a convex relaxation of a rank- and sparsity-constrained nonconvex optimization problem, which is then solved using a block coordinate descent and an alternating direction method of multipliers algorithm. We establish nonasymptotic estimation error bounds under both Frobenius and nuclear norms in the high-dimensional setting. This is a major contribution over existing results in reduced rank regression, which mainly focus on rank selection and prediction consistency. Our theoretical results quantify the tradeoff between heavy-tailedness of the random noise and statistical bias. For random noise with bounded (1+δ)th moment with δ∈(0,1), the rate of convergence is a function of δ, and is slower than the sub-Gaussian-type deviation bounds; for random noise with bounded second moment, we obtain a rate of convergence as if sub-Gaussian noise were assumed. We illustrate the performance of the proposed method via extensive numerical studies and a data application. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2383-2393 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2050243 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2050243 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2383-2393 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2093206_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Zhu Li Author-X-Name-First: Zhu Author-X-Name-Last: Li Author-Name: Weijie J. Su Author-X-Name-First: Weijie J. Author-X-Name-Last: Su Author-Name: Dino Sejdinovic Author-X-Name-First: Dino Author-X-Name-Last: Sejdinovic Title: Benign Overfitting and Noisy Features Abstract: Modern machine learning models often exhibit the benign overfitting phenomenon, which has recently been characterized using the double descent curves. In addition to the classical U-shaped learning curve, the learning risk undergoes another descent as we increase the number of parameters beyond a certain threshold. In this article, we examine the conditions under which benign overfitting occurs in the random feature (RF) models, that is, in a two-layer neural network with fixed first layer weights. Adopting a novel view of random features, we show that benign overfitting emerges because of the noise residing in such features. The noise may already exist in the data and propagates to the features, or it may be added by the user to the features directly. Such noise plays an implicit yet crucial regularization role in the phenomenon. In addition, we derive the explicit tradeoff between the number of parameters and the prediction accuracy, and for the first time demonstrate that overparameterized model can achieve the optimal learning rate in the minimax sense. Finally, our results indicate that the learning risk for overparameterized models has multiple, instead of double descent behavior, which is empirically verified in recent works. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2876-2888 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2093206 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2093206 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2876-2888 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2257260_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Jaewoo Park Author-X-Name-First: Jaewoo Author-X-Name-Last: Park Title: Bayesian Filtering and Smoothing, 2nd ed. Journal: Journal of the American Statistical Association Pages: 2943-2945 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2023.2257260 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2257260 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2943-2945 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2087660_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Marinho Bertanha Author-X-Name-First: Marinho Author-X-Name-Last: Bertanha Author-Name: Eunyi Chung Author-X-Name-First: Eunyi Author-X-Name-Last: Chung Title: Permutation Tests at Nonparametric Rates Abstract: Classical two-sample permutation tests for equality of distributions have exact size in finite samples, but they fail to control size for testing equality of parameters that summarize each distribution. This article proposes permutation tests for equality of parameters that are estimated at root-n or slower rates. Our general framework applies to both parametric and nonparametric models, with two samples or one sample split into two subsamples. Our tests have correct size asymptotically while preserving exact size in finite samples when distributions are equal. They have no loss in local asymptotic power compared to tests that use asymptotic critical values. We propose confidence sets with correct coverage in large samples that also have exact coverage in finite samples if distributions are equal up to a transformation. We apply our theory to four commonly-used hypothesis tests of nonparametric functions evaluated at a point. Lastly, simulations show good finite sample properties, and two empirical examples illustrate our tests in practice. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2833-2846 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2087660 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2087660 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2833-2846 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2061982_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Tomas Masak Author-X-Name-First: Tomas Author-X-Name-Last: Masak Author-Name: Victor M. Panaretos Author-X-Name-First: Victor M. Author-X-Name-Last: Panaretos Title: Random Surface Covariance Estimation by Shifted Partial Tracing Abstract: The problem of covariance estimation for replicated surface-valued processes is examined from the functional data analysis perspective. Considerations of statistical and computational efficiency often compel the use of separability of the covariance, even though the assumption may fail in practice. We consider a setting where the covariance structure may fail to be separable locally—either due to noise contamination or due to the presence of a nonseparable short-range dependent signal component. That is, the covariance is an additive perturbation of a separable component by a nonseparable but banded component. We introduce nonparametric estimators hinging on the novel concept of shifted partial tracing, enabling computationally efficient estimation of the model under dense observation. Due to the denoising properties of shifted partial tracing, our methods are shown to yield consistent estimators even under noisy discrete observation, without the need for smoothing. Further to deriving the convergence rates and limit theorems, we also show that the implementation of our estimators, including prediction, comes at no computational overhead relative to a separable model. Finally, we demonstrate empirical performance and computational feasibility of our methods in an extensive simulation study and on a real dataset. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2562-2574 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2061982 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2061982 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2562-2574 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2044825_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Qing Mai Author-X-Name-First: Qing Author-X-Name-Last: Mai Author-Name: Di He Author-X-Name-First: Di Author-X-Name-Last: He Author-Name: Hui Zou Author-X-Name-First: Hui Author-X-Name-Last: Zou Title: Coordinatewise Gaussianization: Theories and Applications Abstract: In statistical analysis, researchers often perform coordinatewise Gaussianization such that each variable is marginally normal. The normal score transformation is a method for coordinatewise Gaussianization and is widely used in statistics, econometrics, genetics and other areas. However, few studies exist on the theoretical properties of the normal score transformation, especially in high-dimensional problems where the dimension p diverges with the sample size n. In this article, we show that the normal score transformation uniformly converges to its population counterpart even when log p=o(n/ log n). Our result can justify the normal score transformation prior to any downstream statistical method to which the theoretical normal transformation is beneficial. The same results are established for the Winsorized normal transformation, another popular choice for coordinatewise Gaussianization. We demonstrate the benefits of coordinatewise Gaussianization by studying its applications to the Gaussian copula model, the nearest shrunken centroids classifier and distance correlation. The benefits are clearly shown in theory and supported by numerical studies. Moreover, we also point out scenarios where coordinatewise Gaussinization does not help and even causes damages. We offer a general recommendation on how to use coordinatewise Gaussianization in applications. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2329-2343 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2044825 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2044825 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2329-2343 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2208390_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Georgia Papadogeorgou Author-X-Name-First: Georgia Author-X-Name-Last: Papadogeorgou Author-Name: Carolina Bello Author-X-Name-First: Carolina Author-X-Name-Last: Bello Author-Name: Otso Ovaskainen Author-X-Name-First: Otso Author-X-Name-Last: Ovaskainen Author-Name: David B. Dunson Author-X-Name-First: David B. Author-X-Name-Last: Dunson Title: Covariate-Informed Latent Interaction Models: Addressing Geographic & Taxonomic Bias in Predicting Bird–Plant Interactions Abstract: Reductions in natural habitats urge that we better understand species’ interconnection and how biological communities respond to environmental changes. However, ecological studies of species’ interactions are limited by their geographic and taxonomic focus which can distort our understanding of interaction dynamics. We focus on bird–plant interactions that refer to situations of potential fruit consumption and seed dispersal. We develop an approach for predicting species’ interactions that accounts for errors in the recorded interaction networks, addresses the geographic and taxonomic biases of existing studies, is based on latent factors to increase flexibility and borrow information across species, incorporates covariates in a flexible manner to inform the latent factors, and uses a meta-analysis dataset from 85 individual studies. We focus on interactions among 232 birds and 511 plants in the Atlantic Forest, and identify 5% of pairs of species with an unrecorded interaction, but posterior probability that the interaction is possible over 80%. Finally, we develop a permutation-based variable importance procedure for latent factor network models and identify that a bird’s body mass and a plant’s fruit diameter are important in driving the presence of species interactions, with a multiplicative relationship that exhibits both a thresholding and a matching behavior. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2250-2261 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2023.2208390 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2208390 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2250-2261 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2071279_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Zhaoxing Gao Author-X-Name-First: Zhaoxing Author-X-Name-Last: Gao Author-Name: Ruey S. Tsay Author-X-Name-First: Ruey S. Author-X-Name-Last: Tsay Title: Divide-and-Conquer: A Distributed Hierarchical Factor Approach to Modeling Large-Scale Time Series Data Abstract: This article proposes a hierarchical approximate-factor approach to analyzing high-dimensional, large-scale heterogeneous time series data using distributed computing. The new method employs a multiple-fold dimension reduction procedure using Principal Component Analysis (PCA) and shows great promises for modeling large-scale data that cannot be stored nor analyzed by a single machine. Each computer at the basic level performs a PCA to extract common factors among the time series assigned to it and transfers those factors to one and only one node of the second level. Each second-level computer collects the common factors from its subordinates and performs another PCA to select the second-level common factors. This process is repeated until the central server is reached, which collects factors from its direct subordinates and performs a final PCA to select the global common factors. The noise terms of the second-level approximate factor model are the unique common factors of the first-level clusters. We focus on the case of two levels in our theoretical derivations, but the idea can easily be generalized to any finite number of hierarchies, and the proposed method is also applicable to data with heterogeneous and multilevel subcluster structures that are stored and analyzed by a single machine. We introduce a new diffusion index approach to forecasting based on the global and group-specific factors. Some clustering methods are discussed in the supplement when the group memberships are unknown. We further extend the analysis to unit-root nonstationary time series. Asymptotic properties of the proposed method are derived for the diverging dimension of the data in each computing unit and the sample size T. We use both simulated and real examples to assess the performance of the proposed method in finite samples, and compare our method with the commonly used ones in the literature concerning the forecasting ability of extracted factors. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2698-2711 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2071279 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2071279 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2698-2711 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2051519_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Weijing Tang Author-X-Name-First: Weijing Author-X-Name-Last: Tang Author-Name: Kevin He Author-X-Name-First: Kevin Author-X-Name-Last: He Author-Name: Gongjun Xu Author-X-Name-First: Gongjun Author-X-Name-Last: Xu Author-Name: Ji Zhu Author-X-Name-First: Ji Author-X-Name-Last: Zhu Title: Survival Analysis via Ordinary Differential Equations Abstract: This article introduces an Ordinary Differential Equation (ODE) notion for survival analysis. The ODE notion not only provides a unified modeling framework, but more importantly, also enables the development of a widely applicable, scalable, and easy-to-implement procedure for estimation and inference. Specifically, the ODE modeling framework unifies many existing survival models, such as the proportional hazards model, the linear transformation model, the accelerated failure time model, and the time-varying coefficient model as special cases. The generality of the proposed framework serves as the foundation of a widely applicable estimation procedure. As an illustrative example, we develop a sieve maximum likelihood estimator for a general semiparametric class of ODE models. In comparison to existing estimation methods, the proposed procedure has advantages in terms of computational scalability and numerical stability. Moreover, to address unique theoretical challenges induced by the ODE notion, we establish a new general sieve M-theorem for bundled parameters and show that the proposed sieve estimator is consistent and asymptotically normal, and achieves the semiparametric efficiency bound. The finite sample performance of the proposed estimator is examined in simulation studies and a real-world data example. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2406-2421 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2051519 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2051519 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2406-2421 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2060836_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Tetsuya Kaji Author-X-Name-First: Tetsuya Author-X-Name-Last: Kaji Author-Name: Veronika Ročková Author-X-Name-First: Veronika Author-X-Name-Last: Ročková Title: Metropolis–Hastings via Classification Abstract: This article develops a Bayesian computational platform at the interface between posterior sampling and optimization in models whose marginal likelihoods are difficult to evaluate. Inspired by contrastive learning and Generative Adversarial Networks (GAN), we reframe the likelihood function estimation problem as a classification problem. Pitting a Generator, who simulates fake data, against a Classifier, who tries to distinguish them from the real data, one obtains likelihood (ratio) estimators which can be plugged into the Metropolis–Hastings algorithm. The resulting Markov chains generate, at a steady state, samples from an approximate posterior whose asymptotic properties we characterize. Drawing upon connections with empirical Bayes and Bayesian misspecification, we quantify the convergence rate in terms of the contraction speed of the actual posterior and the convergence rate of the Classifier. Asymptotic normality results are also provided which justify the inferential potential of our approach. We illustrate the usefulness of our approach on examples which have challenged for existing Bayesian likelihood-free approaches. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2533-2547 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2060836 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2060836 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2533-2547 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2223680_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Michael Schweinberger Author-X-Name-First: Michael Author-X-Name-Last: Schweinberger Author-Name: Cornelius Fritz Author-X-Name-First: Cornelius Author-X-Name-Last: Fritz Title: Discussion of “A Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks” by Pavel N. Krivitsky, Pietro Coletti, and Niel Hens Journal: Journal of the American Statistical Association Pages: 2225-2227 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2023.2223680 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2223680 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2225-2227 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2057317_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857 Author-Name: Shulei Wang Author-X-Name-First: Shulei Author-X-Name-Last: Wang Title: Self-supervised Metric Learning in Multi-View Data: A Downstream Task Perspective Abstract: Self-supervised metric learning has been a successful approach for learning a distance from an unlabeled dataset. The resulting distance is broadly useful for improving various distance-based downstream tasks, even when no information from downstream tasks is used in the metric learning stage. To gain insights into this approach, we develop a statistical framework to theoretically study how self-supervised metric learning can benefit downstream tasks in the context of multi-view data. Under this framework, we show that the target distance of metric learning satisfies several desired properties for the downstream tasks. On the other hand, our investigation suggests the target distance can be further improved by moderating each direction’s weights. In addition, our analysis precisely characterizes the improvement by self-supervised metric learning on four commonly used downstream tasks: sample identification, two-sample testing, k-means clustering, and k-nearest neighbor classification. When the distance is estimated from an unlabeled dataset, we establish the upper bound on distance estimation’s accuracy and the number of samples sufficient for downstream task improvement. Finally, numerical experiments are presented to support the theoretical results in the article. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 2454-2467 Issue: 544 Volume: 118 Year: 2023 Month: 10 X-DOI: 10.1080/01621459.2022.2057317 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2057317 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2454-2467 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2120400_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Laurens de Haan Author-X-Name-First: Laurens Author-X-Name-Last: de Haan Author-Name: Chen Zhou Author-X-Name-First: Chen Author-X-Name-Last: Zhou Title: Bootstrapping Extreme Value Estimators Abstract: This article develops a bootstrap analogue of the well-known asymptotic expansion of the tail quantile process in extreme value theory. One application of this result is to construct confidence intervals for estimators of the extreme value index such as the Probability Weighted Moment (PWM) estimator. For the peaks-over-threshold method, we show the bootstrap consistency of the confidence intervals. By contrast, the asymptotic expansion of the quantile process of the bootstrapped block maxima does not lead to a similar consistency result for the PWM estimator using the block maxima method. For both methods, We show by simulations that the sample variance of bootstrapped estimates can be a good approximation for the asymptotic variance of the original estimator. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 382-393 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2120400 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2120400 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:382-393 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2127360_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Christoph Alexander Weitkamp Author-X-Name-First: Christoph Alexander Author-X-Name-Last: Weitkamp Author-Name: Katharina Proksch Author-X-Name-First: Katharina Author-X-Name-Last: Proksch Author-Name: Carla Tameling Author-X-Name-First: Carla Author-X-Name-Last: Tameling Author-Name: Axel Munk Author-X-Name-First: Axel Author-X-Name-Last: Munk Title: Distribution of Distances based Object Matching: Asymptotic Inference Abstract: In this article, we aim to provide a statistical theory for object matching based on a lower bound of the Gromov-Wasserstein distance related to the distribution of (pairwise) distances of the considered objects. To this end, we model general objects as metric measure spaces. Based on this, we propose a simple and efficiently computable asymptotic statistical test for pose invariant object discrimination. This is based on a (β-trimmed) empirical version of the afore-mentioned lower bound. We derive the distributional limits of this test statistic for the trimmed and untrimmed case. For this purpose, we introduce a novel U-type process indexed in β and show its weak convergence. The theory developed is investigated in Monte Carlo simulations and applied to structural protein comparisons. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 538-551 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2127360 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2127360 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:538-551 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2102019_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Biao Cai Author-X-Name-First: Biao Author-X-Name-Last: Cai Author-Name: Jingfei Zhang Author-X-Name-First: Jingfei Author-X-Name-Last: Zhang Author-Name: Yongtao Guan Author-X-Name-First: Yongtao Author-X-Name-Last: Guan Title: Latent Network Structure Learning From High-Dimensional Multivariate Point Processes Abstract: Learning the latent network structure from large scale multivariate point process data is an important task in a wide range of scientific and business applications. For instance, we might wish to estimate the neuronal functional connectivity network based on spiking times recorded from a collection of neurons. To characterize the complex processes underlying the observed data, we propose a new and flexible class of nonstationary Hawkes processes that allow both excitatory and inhibitory effects. We estimate the latent network structure using an efficient sparse least squares estimation approach. Using a thinning representation, we establish concentration inequalities for the first and second order statistics of the proposed Hawkes process. Such theoretical results enable us to establish the non-asymptotic error bound and the selection consistency of the estimated parameters. Furthermore, we describe a least squares loss based statistic for testing if the background intensity is constant in time. We demonstrate the efficacy of our proposed method through simulation studies and an application to a neuron spike train dataset. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 95-108 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2102019 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2102019 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:95-108 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2279695_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Patrick M. LeBlanc Author-X-Name-First: Patrick M. Author-X-Name-Last: LeBlanc Author-Name: David Banks Author-X-Name-First: David Author-X-Name-Last: Banks Author-Name: Linhui Fu Author-X-Name-First: Linhui Author-X-Name-Last: Fu Author-Name: Mingyan Li Author-X-Name-First: Mingyan Author-X-Name-Last: Li Author-Name: Zhengyu Tang Author-X-Name-First: Zhengyu Author-X-Name-Last: Tang Author-Name: Qiuyi Wu Author-X-Name-First: Qiuyi Author-X-Name-Last: Wu Title: Recommender Systems: A Review Abstract: Recommender systems are the engine of online advertising. Not only do they suggest movies, music, or romantic partners, but they also are used to select which advertisements to show to users. This paper reviews the basics of recommender system methodology and then looks at the emerging arena of active recommender systems. Journal: Journal of the American Statistical Association Pages: 773-785 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2023.2279695 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2279695 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:773-785 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2116331_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Lucy L. Gao Author-X-Name-First: Lucy L. Author-X-Name-Last: Gao Author-Name: Jacob Bien Author-X-Name-First: Jacob Author-X-Name-Last: Bien Author-Name: Daniela Witten Author-X-Name-First: Daniela Author-X-Name-Last: Witten Title: Selective Inference for Hierarchical Clustering Abstract: Classical tests for a difference in means control the Type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated Type I error rate. Notably, this problem persists even if two separate and independent datasets are used to define the groups and to test for a difference in their means. To address this problem, in this article, we propose a selective inference approach to test for a difference in means between two clusters. Our procedure controls the selective Type I error rate by accounting for the fact that the choice of null hypothesis was made based on the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly used linkages. We apply our method to simulated data and to single-cell RNA-sequencing data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 332-342 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2116331 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2116331 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:332-342 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2140052_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Xin Ma Author-X-Name-First: Xin Author-X-Name-Last: Ma Author-Name: Suprateek Kundu Author-X-Name-First: Suprateek Author-X-Name-Last: Kundu Author-Name: Author-X-Name-First: Author-X-Name-Last: Title: Multi-Task Learning with High-Dimensional Noisy Images Abstract: Recent medical imaging studies have given rise to distinct but inter-related datasets corresponding to multiple experimental tasks or longitudinal visits. Standard scalar-on-image regression models that fit each dataset separately are not equipped to leverage information across inter-related images, and existing multi-task learning approaches are compromised by the inability to account for the noise that is often observed in images. We propose a novel joint scalar-on-image regression framework involving wavelet-based image representations with grouped penalties that are designed to pool information across inter-related images for joint learning, and which explicitly accounts for noise in high-dimensional images via a projection-based approach. In the presence of nonconvexity arising due to noisy images, we derive nonasymptotic error bounds under nonconvex as well as convex grouped penalties, even when the number of voxels increases exponentially with sample size. A projected gradient descent algorithm is used for computation, which is shown to approximate the optimal solution via well-defined nonasymptotic optimization error bounds under noisy images. Extensive simulations and application to a motivating longitudinal Alzheimer’s disease study illustrate significantly improved predictive ability and greater power to detect true signals, that are simply missed by existing methods without noise correction due to the attenuation to null phenomenon. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 650-663 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2140052 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2140052 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:650-663 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2115374_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Mingxue Quan Author-X-Name-First: Mingxue Author-X-Name-Last: Quan Author-Name: Zhenhua Lin Author-X-Name-First: Zhenhua Author-X-Name-Last: Lin Title: Optimal One-Pass Nonparametric Estimation Under Memory Constraint Abstract: For nonparametric regression in the streaming setting, where data constantly flow in and require real-time analysis, a main challenge is that data are cleared from the computer system once processed due to limited computer memory and storage. We tackle the challenge by proposing a novel one-pass estimator based on penalized orthogonal basis expansions and developing a general framework to study the interplay between statistical efficiency and memory consumption of estimators. We show that, the proposed estimator is statistically optimal under memory constraint, and has asymptotically minimal memory footprints among all one-pass estimators of the same estimation quality. Numerical studies demonstrate that the proposed one-pass estimator is nearly as efficient as its nonstreaming counterpart that has access to all historical data. Journal: Journal of the American Statistical Association Pages: 285-296 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2115374 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2115374 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:285-296 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2115918_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Federico Camerlenghi Author-X-Name-First: Federico Author-X-Name-Last: Camerlenghi Author-Name: Stefano Favaro Author-X-Name-First: Stefano Author-X-Name-Last: Favaro Author-Name: Lorenzo Masoero Author-X-Name-First: Lorenzo Author-X-Name-Last: Masoero Author-Name: Tamara Broderick Author-X-Name-First: Tamara Author-X-Name-Last: Broderick Title: Scaled Process Priors for Bayesian Nonparametric Estimation of the Unseen Genetic Variation Abstract: There is a growing interest in the estimation of the number of unseen features, mostly driven by biological applications. A recent work brought out a peculiar property of the popular completely random measures (CRMs) as prior models in Bayesian nonparametric (BNP) inference for the unseen-features problem: for fixed prior’s parameters, they all lead to a Poisson posterior distribution for the number of unseen features, which depends on the sampling information only through the sample size. CRMs are thus not a flexible prior model for the unseen-features problem and, while the Poisson posterior distribution may be appealing for analytical tractability and ease of interpretability, its independence from the sampling information makes the BNP approach a questionable oversimplification, with posterior inferences being completely determined by the estimation of unknown prior’s parameters. In this article, we introduce the stable-Beta scaled process (SB-SP) prior, and we show that it allows to enrich the posterior distribution of the number of unseen features arising under CRM priors, while maintaining its analytical tractability and interpretability. That is, the SB-SP prior leads to a negative Binomial posterior distribution, which depends on the sampling information through the sample size and the number of distinct features, with corresponding estimates being simple, linear in the sampling information and computationally efficient. We apply our BNP approach to synthetic data and to real cancer genomic data, showing that: (i) it outperforms the most popular parametric and nonparametric competitors in terms of estimation accuracy; (ii) it provides improved coverage for the estimation with respect to a BNP approach under CRM priors. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 320-331 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2115918 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2115918 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:320-331 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2123814_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Anqi Zhao Author-X-Name-First: Anqi Author-X-Name-Last: Zhao Author-Name: Peng Ding Author-X-Name-First: Peng Author-X-Name-Last: Ding Title: To Adjust or not to Adjust? Estimating the Average Treatment Effect in Randomized Experiments with Missing Covariates Abstract: Randomized experiments allow for consistent estimation of the average treatment effect based on the difference in mean outcomes without strong modeling assumptions. Appropriate use of pretreatment covariates can further improve the estimation efficiency. Missingness in covariates is nevertheless common in practice, and raises an important question: should we adjust for covariates subject to missingness, and if so, how? The unadjusted difference in means is always unbiased. The complete-covariate analysis adjusts for all completely observed covariates, and is asymptotically more efficient than the difference in means if at least one completely observed covariate is predictive of the outcome. Then what is the additional gain of adjusting for covariates subject to missingness? To reconcile the conflicting recommendations in the literature, we analyze and compare five strategies for handling missing covariates in randomized experiments under the design-based framework, and recommend the missingness-indicator method, as a known but not so popular strategy in the literature, due to its multiple advantages. First, it removes the dependence of the regression-adjusted estimators on the imputed values for the missing covariates. Second, it does not require modeling the missingness mechanism, and yields consistent estimators even when the missingness mechanism is related to the missing covariates and unobservable potential outcomes. Third, it ensures large-sample efficiency over the complete-covariate analysis and the analysis based on only the imputed covariates. Lastly, it is easy to implement via least squares. We also propose modifications to it based on asymptotic and finite sample considerations. Importantly, our theory views randomization as the basis for inference, and does not impose any modeling assumptions on the data-generating process or missingness mechanism. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 450-460 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2123814 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2123814 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:450-460 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2105223_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Naoki Awaya Author-X-Name-First: Naoki Author-X-Name-Last: Awaya Author-Name: Li Ma Author-X-Name-First: Li Author-X-Name-Last: Ma Title: Hidden Markov Pólya Trees for High-Dimensional Distributions Abstract: The Pólya tree (PT) process is a general-purpose Bayesian nonparametric model that has found wide application in a range of inference problems. It has a simple analytic form and the posterior computation boils down to beta-binomial conjugate updates along a partition tree over the sample space. Recent development in PT models shows that performance of these models can be substantially improved by (i) allowing the partition tree to adapt to the structure of the underlying distributions and (ii) incorporating latent state variables that characterize local features of the underlying distributions. However, important limitations of the PT remain, including (i) the sensitivity in the posterior inference with respect to the choice of the partition tree, and (ii) the lack of scalability with respect to dimensionality of the sample space. We consider a modeling strategy for PT models that incorporates a flexible prior on the partition tree along with latent states with Markov dependency. We introduce a hybrid algorithm combining sequential Monte Carlo (SMC) and recursive message passing for posterior sampling that can scale up to 100 dimensions. While our description of the algorithm assumes a single computer environment, it has the potential to be implemented on distributed systems to further enhance the scalability. Moreover, we investigate the large sample properties of the tree structures and latent states under the posterior model. We carry out extensive numerical experiments in density estimation and two-group comparison, which show that flexible partitioning can substantially improve the performance of PT models in both inference tasks. We demonstrate an application to a mass cytometry dataset with 19 dimensions and over 200,000 observations. Supplementary Materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 189-201 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2105223 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2105223 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:189-201 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2126363_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Xin Zhang Author-X-Name-First: Xin Author-X-Name-Last: Zhang Author-Name: Jia Liu Author-X-Name-First: Jia Author-X-Name-Last: Liu Author-Name: Zhengyuan Zhu Author-X-Name-First: Zhengyuan Author-X-Name-Last: Zhu Title: Learning Coefficient Heterogeneity over Networks: A Distributed Spanning-Tree-Based Fused-Lasso Regression Abstract: Identifying the latent cluster structure based on model heterogeneity is a fundamental but challenging task arises in many machine learning applications. In this article, we study the clustered coefficient regression problem in the distributed network systems, where the data are locally collected and held by nodes. Our work aims to improve the regression estimation efficiency by aggregating the neighbors’ information while also identifying the cluster membership for nodes. To achieve efficient estimation and clustering, we develop a distributed spanning-tree-based fused-lasso regression (DTFLR) approach. In particular, we propose an adaptive spanning-tree-based fusion penalty for the low-complexity clustered coefficient regression. We show that our proposed estimator satisfies statistical oracle properties. Additionally, to solve the problem parallelly, we design a distributed generalized alternating direction method of multiplier algorithm, which has a simple node-based implementation scheme and enjoys a linear convergence rate. Collectively, our results in this article contribute to the theories of low-complexity clustered coefficient regression and distributed optimization over networks. Thorough numerical experiments and real-world data analysis are conducted to verify our theoretical results, which show that our approach outperforms existing works in terms of estimation accuracy, computation speed, and communication costs. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 485-497 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2126363 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2126363 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:485-497 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2294527_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Shengbin Ye Author-X-Name-First: Shengbin Author-X-Name-Last: Ye Author-Name: Thomas P. Senftle Author-X-Name-First: Thomas P. Author-X-Name-Last: Senftle Author-Name: Meng Li Author-X-Name-First: Meng Author-X-Name-Last: Li Title: Operator-Induced Structural Variable Selection for Identifying Materials Genes Abstract: In the emerging field of materials informatics, a fundamental task is to identify physicochemically meaningful descriptors, or materials genes, which are engineered from primary features and a set of elementary algebraic operators through compositions. Standard practice directly analyzes the high-dimensional candidate predictor space in a linear model; statistical analyses are then substantially hampered by the daunting challenge posed by the astronomically large number of correlated predictors with limited sample size. We formulate this problem as variable selection with operator-induced structure (OIS) and propose a new method to achieve unconventional dimension reduction by using the geometry embedded in OIS. Although the model remains linear, we iterate nonparametric variable selection for effective dimension reduction. This enables variable selection based on ab initio primary features, leading to a method that is orders of magnitude faster than existing methods, with improved accuracy. To select the nonparametric module, we discuss a desired performance criterion that is uniquely induced by variable selection with OIS; in particular, we propose to employ a Bayesian Additive Regression Trees (BART)-based variable selection method. Numerical studies show superiority of the proposed method, which continues to exhibit robust performance when the input dimension is out of reach of existing methods. Our analysis of single-atom catalysis identifies physical descriptors that explain the binding energy of metal-support pairs with high explanatory power, leading to interpretable insights to guide the prevention of a notorious problem called sintering and aid catalysis design. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 81-94 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2023.2294527 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2294527 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:81-94 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2142590_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Bingyuan Liu Author-X-Name-First: Bingyuan Author-X-Name-Last: Liu Author-Name: Qi Zhang Author-X-Name-First: Qi Author-X-Name-Last: Zhang Author-Name: Lingzhou Xue Author-X-Name-First: Lingzhou Author-X-Name-Last: Xue Author-Name: Peter X.-K. Song Author-X-Name-First: Peter X.-K. Author-X-Name-Last: Song Author-Name: Jian Kang Author-X-Name-First: Jian Author-X-Name-Last: Kang Title: Robust High-Dimensional Regression with Coefficient Thresholding and Its Application to Imaging Data Analysis Abstract: It is important to develop statistical techniques to analyze high-dimensional data in the presence of both complex dependence and possible heavy tails and outliers in real-world applications such as imaging data analyses. We propose a new robust high-dimensional regression with coefficient thresholding, in which an efficient nonconvex estimation procedure is proposed through a thresholding function and the robust Huber loss. The proposed regularization method accounts for complex dependence structures in predictors and is robust against heavy tails and outliers in outcomes. Theoretically, we rigorously analyze the landscape of the population and empirical risk functions for the proposed method. The fine landscape enables us to establish both statistical consistency and computational convergence under the high-dimensional setting. We also present an extension to incorporate spatial information into the proposed method. Finite-sample properties of the proposed methods are examined by extensive simulation studies. An application concerns a scalar-on-image regression analysis for an association of psychiatric disorder measured by the general factor of psychopathology with features extracted from the task functional MRI data in the Adolescent Brain Cognitive Development (ABCD) study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 715-729 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2142590 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2142590 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:715-729 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2270795_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Andrés F. Barrientos Author-X-Name-First: Andrés F. Author-X-Name-Last: Barrientos Author-Name: Aaron R. Williams Author-X-Name-First: Aaron R. Author-X-Name-Last: Williams Author-Name: Joshua Snoke Author-X-Name-First: Joshua Author-X-Name-Last: Snoke Author-Name: Claire McKay Bowen Author-X-Name-First: Claire McKay Author-X-Name-Last: Bowen Title: A Feasibility Study of Differentially Private Summary Statistics and Regression Analyses with Evaluations on Administrative and Survey Data Abstract: Federal administrative data, such as tax data, are invaluable for research, but because of privacy concerns, access to these data is typically limited to select agencies and a few individuals. An alternative to sharing microlevel data is to allow individuals to query statistics without directly accessing the confidential data. This article studies the feasibility of using differentially private (DP) methods to make certain queries while preserving privacy. We also include new methodological adaptations to existing DP regression methods for using new data types and returning standard error estimates. We define feasibility as the impact of DP methods on analyses for making public policy decisions and the queries accuracy according to several utility metrics. We evaluate the methods using Internal Revenue Service data and public-use Current Population Survey data and identify how specific data features might challenge some of these methods. Our findings show that DP methods are feasible for simple, univariate statistics but struggle to produce accurate regression estimates and confidence intervals. To the best of our knowledge, this is the first comprehensive statistical study of DP regression methodology on real, complex datasets, and the findings have significant implications for the direction of a growing research field and public policy. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 52-65 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2023.2270795 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2270795 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:52-65 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2138760_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Wenzhuo Zhou Author-X-Name-First: Wenzhuo Author-X-Name-Last: Zhou Author-Name: Ruoqing Zhu Author-X-Name-First: Ruoqing Author-X-Name-Last: Zhu Author-Name: Annie Qu Author-X-Name-First: Annie Author-X-Name-Last: Qu Title: Estimating Optimal Infinite Horizon Dynamic Treatment Regimes via pT-Learning Abstract: Recent advances in mobile health (mHealth) technology provide an effective way to monitor individuals’ health statuses and deliver just-in-time personalized interventions. However, the practical use of mHealth technology raises unique challenges to existing methodologies on learning an optimal dynamic treatment regime. Many mHealth applications involve decision-making with large numbers of intervention options and under an infinite time horizon setting where the number of decision stages diverges to infinity. In addition, temporary medication shortages may cause optimal treatments to be unavailable, while it is unclear what alternatives can be used. To address these challenges, we propose a Proximal Temporal consistency Learning (pT-Learning) framework to estimate an optimal regime that is adaptively adjusted between deterministic and stochastic sparse policy models. The resulting minimax estimator avoids the double sampling issue in the existing algorithms. It can be further simplified and can easily incorporate off-policy data without mismatched distribution corrections. We study theoretical properties of the sparse policy and establish finite-sample bounds on the excess risk and performance error. The proposed method is provided in our proximalDTR package and is evaluated through extensive simulation studies and the OhioT1DM mHealth dataset. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 625-638 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2138760 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2138760 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:625-638 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2104728_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Jiaming Qiu Author-X-Name-First: Jiaming Author-X-Name-Last: Qiu Author-Name: Xiongtao Dai Author-X-Name-First: Xiongtao Author-X-Name-Last: Dai Author-Name: Zhengyuan Zhu Author-X-Name-First: Zhengyuan Author-X-Name-Last: Zhu Title: Nonparametric Estimation of Repeated Densities with Heterogeneous Sample Sizes Abstract: We consider the estimation of densities in multiple subpopulations, where the available sample size in each subpopulation greatly varies. This problem occurs in epidemiology, for example, where different diseases may share similar pathogenic mechanism but differ in their prevalence. Without specifying a parametric form, our proposed method pools information from the population and estimate the density in each subpopulation in a data-driven fashion. Drawing from functional data analysis, low-dimensional approximating density families in the form of exponential families are constructed from the principal modes of variation in the log-densities. Subpopulation densities are subsequently fitted in the approximating families based on likelihood principles and shrinkage. The approximating families increase in their flexibility as the number of components increases and can approximate arbitrary infinite-dimensional densities. We also derive convergence results of the density estimates formed with discrete observations. The proposed methods are shown to be interpretable and efficient in simulation studies as well as applications to electronic medical record and rainfall data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 176-188 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2104728 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2104728 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:176-188 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2128359_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Jianqing Fan Author-X-Name-First: Jianqing Author-X-Name-Last: Fan Author-Name: Yongyi Guo Author-X-Name-First: Yongyi Author-X-Name-Last: Guo Author-Name: Mengxin Yu Author-X-Name-First: Mengxin Author-X-Name-Last: Yu Title: Policy Optimization Using Semiparametric Models for Dynamic Pricing Abstract: In this article, we study the contextual dynamic pricing problem where the market value of a product is linear in its observed features plus some market noise. Products are sold one at a time, and only a binary response indicating success or failure of a sale is observed. Our model setting is similar to the work by? except that we expand the demand curve to a semiparametric model and learn dynamically both parametric and nonparametric components. We propose a dynamic statistical learning and decision making policy that minimizes regret (maximizes revenue) by combining semiparametric estimation for a generalized linear model with unknown link and online decision making. Under mild conditions, for a market noise cdf F(·) with mth order derivative ( m≥2), our policy achieves a regret upper bound of O˜d(T2m+14m−1), where T is the time horizon and O˜d is the order hiding logarithmic terms and the feature dimension d. The upper bound is further reduced to O˜d(T) if F is super smooth. These upper bounds are close to Ω(T), the lower bound where F belongs to a parametric class. We further generalize these results to the case with dynamic dependent product features under the strong mixing condition. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 552-564 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2128359 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2128359 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:552-564 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2126780_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Matteo Barigozzi Author-X-Name-First: Matteo Author-X-Name-Last: Barigozzi Author-Name: Matteo Farnè Author-X-Name-First: Matteo Author-X-Name-Last: Farnè Title: An Algebraic Estimator for Large Spectral Density Matrices Abstract: We propose a new estimator of high-dimensional spectral density matrices, called ALgebraic Spectral Estimator (ALSE), under the assumption of an underlying low rank plus sparse structure, as typically assumed in dynamic factor models. The ALSE is computed by minimizing a quadratic loss under a nuclear norm plus l1 norm constraint to control the latent rank and the residual sparsity pattern. The loss function requires as input the classical smoothed periodogram estimator and two threshold parameters, the choice of which is thoroughly discussed. We prove consistency of ALSE as both the dimension p and the sample size T diverge to infinity, as well as the recovery of latent rank and residual sparsity pattern with probability one. We then propose the UNshrunk ALgebraic Spectral Estimator (UNALSE), which is designed to minimize the Frobenius loss with respect to the pre-estimator while retaining the optimality of the ALSE. When applying UNALSE to a standard U.S. quarterly macroeconomic dataset, we find evidence of two main sources of comovements: a real factor driving the economy at business cycle frequencies, and a nominal factor driving the higher frequency dynamics. The article is also complemented by an extensive simulation exercise. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 498-510 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2126780 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2126780 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:498-510 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2118602_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Alessandro Mastrototaro Author-X-Name-First: Alessandro Author-X-Name-Last: Mastrototaro Author-Name: Jimmy Olsson Author-X-Name-First: Jimmy Author-X-Name-Last: Olsson Author-Name: Johan Alenlöv Author-X-Name-First: Johan Author-X-Name-Last: Alenlöv Title: Fast and Numerically Stable Particle-Based Online Additive Smoothing: The AdaSmooth Algorithm Abstract: We present a novel sequential Monte Carlo approach to online smoothing of additive functionals in a very general class of path-space models. Hitherto, the solutions proposed in the literature suffer from either long-term numerical instability due to particle-path degeneracy or, in the case that degeneracy is remedied by particle approximation of the so-called backward kernel, high computational demands. In order to balance optimally computational speed against numerical stability, we propose to furnish a (fast) naive particle smoother, propagating recursively a sample of particles and associated smoothing statistics, with an adaptive backward-sampling-based updating rule which allows the number of (costly) backward samples to be kept at a minimum. This yields a new, function-specific additive smoothing algorithm, AdaSmooth, which is computationally fast, numerically stable and easy to implement. The algorithm is provided with rigorous theoretical results guaranteeing its consistency, asymptotic normality and long-term stability as well as numerical results demonstrating empirically the clear superiority of AdaSmooth to existing algorithms. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 356-367 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2118602 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2118602 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:356-367 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2141636_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Yunlu Jiang Author-X-Name-First: Yunlu Author-X-Name-Last: Jiang Author-Name: Xueqin Wang Author-X-Name-First: Xueqin Author-X-Name-Last: Wang Author-Name: Canhong Wen Author-X-Name-First: Canhong Author-X-Name-Last: Wen Author-Name: Yukang Jiang Author-X-Name-First: Yukang Author-X-Name-Last: Jiang Author-Name: Heping Zhang Author-X-Name-First: Heping Author-X-Name-Last: Zhang Title: Nonparametric Two-Sample Tests of High Dimensional Mean Vectors via Random Integration Abstract: Testing the equality of the means in two samples is a fundamental statistical inferential problem. Most of the existing methods are based on the sum-of-squares or supremum statistics. They are possibly powerful in some situations, but not in others, and they do not work in a unified way. Using random integration of the difference, we develop a framework that includes and extends many existing methods, especially in high-dimensional settings, without restricting the same covariance matrices or sparsity. Under a general multivariate model, we can derive the asymptotic properties of the proposed test statistic without specifying a relationship between the data dimension and sample size explicitly. Specifically, the new framework allows us to better understand the test’s properties and select a powerful procedure accordingly. For example, we prove that our proposed test can achieve the power of 1 when nonzero signals in the true mean differences are weakly dense with nearly the same sign. In addition, we delineate the conditions under which the asymptotic relative Pitman efficiency of our proposed test to its competitor is greater than or equal to 1. Extensive numerical studies and a real data example demonstrate the potential of our proposed test. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 701-714 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2141636 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2141636 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:701-714 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2303300_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: The Editors Title: The Journal of the American Statistical Association 2023 Associate Editors Journal: Journal of the American Statistical Association Pages: 792-793 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2024.2303300 File-URL: http://hdl.handle.net/10.1080/01621459.2024.2303300 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:792-793 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2126781_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Xiufan Yu Author-X-Name-First: Xiufan Author-X-Name-Last: Yu Author-Name: Danning Li Author-X-Name-First: Danning Author-X-Name-Last: Li Author-Name: Lingzhou Xue Author-X-Name-First: Lingzhou Author-X-Name-Last: Xue Title: Fisher’s Combined Probability Test for High-Dimensional Covariance Matrices Abstract: Testing large covariance matrices is of fundamental importance in statistical analysis with high-dimensional data. In the past decade, three types of test statistics have been studied in the literature: quadratic form statistics, maximum form statistics, and their weighted combination. It is known that quadratic form statistics would suffer from low power against sparse alternatives and maximum form statistics would suffer from low power against dense alternatives. The weighted combination methods were introduced to enhance the power of quadratic form statistics or maximum form statistics when the weights are appropriately chosen. In this article, we provide a new perspective to exploit the full potential of quadratic form statistics and maximum form statistics for testing high-dimensional covariance matrices. We propose a scale-invariant power-enhanced test based on Fisher’s method to combine the p-values of quadratic form statistics and maximum form statistics. After carefully studying the asymptotic joint distribution of quadratic form statistics and maximum form statistics, we first prove that the proposed combination method retains the correct asymptotic size under the Gaussian assumption, and we also derive a new Lyapunov-type bound for the joint distribution and prove the correct asymptotic size of the proposed method without requiring the Gaussian assumption. Moreover, we show that the proposed method boosts the asymptotic power against more general alternatives. Finally, we demonstrate the finite-sample performance in simulation studies and a real application. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 511-524 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2126781 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2126781 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:511-524 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2141635_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Blair Bilodeau Author-X-Name-First: Blair Author-X-Name-Last: Bilodeau Author-Name: Alex Stringer Author-X-Name-First: Alex Author-X-Name-Last: Stringer Author-Name: Yanbo Tang Author-X-Name-First: Yanbo Author-X-Name-Last: Tang Title: Stochastic Convergence Rates and Applications of Adaptive Quadrature in Bayesian Inference Abstract: We provide the first stochastic convergence rates for a family of adaptive quadrature rules used to normalize the posterior distribution in Bayesian models. Our results apply to the uniform relative error in the approximate posterior density, the coverage probabilities of approximate credible sets, and approximate moments and quantiles, therefore, guaranteeing fast asymptotic convergence of approximate summary statistics used in practice. The family of quadrature rules includes adaptive Gauss-Hermite quadrature, and we apply this rule in two challenging low-dimensional examples. Further, we demonstrate how adaptive quadrature can be used as a crucial component of a modern approximate Bayesian inference procedure for high-dimensional additive models. The method is implemented and made publicly available in the aghq package for the R language, available on CRAN. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 690-700 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2141635 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2141635 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:690-700 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2102503_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Mingzhang Yin Author-X-Name-First: Mingzhang Author-X-Name-Last: Yin Author-Name: Claudia Shi Author-X-Name-First: Claudia Author-X-Name-Last: Shi Author-Name: Yixin Wang Author-X-Name-First: Yixin Author-X-Name-Last: Wang Author-Name: David M. Blei Author-X-Name-First: David M. Author-X-Name-Last: Blei Title: Conformal Sensitivity Analysis for Individual Treatment Effects Abstract: Estimating an individual treatment effect (ITE) is essential to personalized decision making. However, existing methods for estimating the ITE often rely on unconfoundedness, an assumption that is fundamentally untestable with observed data. To assess the robustness of individual-level causal conclusion with unconfoundedness, this article proposes a method for sensitivity analysis of the ITE, a way to estimate a range of the ITE under unobserved confounding. The method we develop quantifies unmeasured confounding through a marginal sensitivity model, and adapts the framework of conformal inference to estimate an ITE interval at a given confounding strength. In particular, we formulate this sensitivity analysis as a conformal inference problem under distribution shift, and we extend existing methods of covariate-shifted conformal inference to this more general setting. The resulting predictive interval has guaranteed nominal coverage of the ITE and provides this coverage with distribution-free and nonasymptotic guarantees. We evaluate the method on synthetic data and illustrate its application in an observational study. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 122-135 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2102503 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2102503 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:122-135 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2139265_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Yi Ding Author-X-Name-First: Yi Author-X-Name-Last: Ding Author-Name: Yingying Li Author-X-Name-First: Yingying Author-X-Name-Last: Li Author-Name: Rui Song Author-X-Name-First: Rui Author-X-Name-Last: Song Title: Statistical Learning for Individualized Asset Allocation Abstract: We establish a high-dimensional statistical learning framework for individualized asset allocation. Our proposed methodology addresses continuous-action decision-making with a large number of characteristics. We develop a discretization approach to model the effect of continuous actions and allow the discretization frequency to be large and diverge with the number of observations. We estimate the value function of continuous-action using penalized regression with our proposed generalized penalties that are imposed on linear transformations of the model coefficients. We show that our proposed Discretization and Regression with generalized fOlded concaVe penalty on Effect discontinuity (DROVE) approach enjoys desirable theoretical properties and allows for statistical inference of the optimal value associated with optimal decision-making. Empirically, the proposed framework is exercised with the Health and Retirement Study data in finding individualized optimal asset allocation. The results show that our individualized optimal strategy improves the financial well-being of the population. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 639-649 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2139265 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2139265 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:639-649 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2140053_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Diego Morales-Navarrete Author-X-Name-First: Diego Author-X-Name-Last: Morales-Navarrete Author-Name: Moreno Bevilacqua Author-X-Name-First: Moreno Author-X-Name-Last: Bevilacqua Author-Name: Christian Caamaño-Carrillo Author-X-Name-First: Christian Author-X-Name-Last: Caamaño-Carrillo Author-Name: Luis M. Castro Author-X-Name-First: Luis M. Author-X-Name-Last: Castro Title: Modeling Point Referenced Spatial Count Data: A Poisson Process Approach Abstract: Random fields are useful mathematical tools for representing natural phenomena with complex dependence structures in space and/or time. In particular, the Gaussian random field is commonly used due to its attractive properties and mathematical tractability. However, this assumption seems to be restrictive when dealing with counting data. To deal with this situation, we propose a random field with a Poisson marginal distribution considering a sequence of independent copies of a random field with an exponential marginal distribution as “inter-arrival times” in the counting renewal processes framework. Our proposal can be viewed as a spatial generalization of the Poisson counting process. Unlike the classical hierarchical Poisson Log-Gaussian model, our proposal generates a (non)-stationary random field that is mean square continuous and with Poisson marginal distributions. For the proposed Poisson spatial random field, analytic expressions for the covariance function and the bivariate distribution are provided. In an extensive simulation study, we investigate the weighted pairwise likelihood as a method for estimating the Poisson random field parameters. Finally, the effectiveness of our methodology is illustrated by an analysis of reindeer pellet-group survey data, where a zero-inflated version of the proposed model is compared with zero-inflated Poisson Log-Gaussian and Poisson Gaussian copula models. Supplementary materials for this article, including technical proofs and R code for reproducing the work, are available as an online supplement. Journal: Journal of the American Statistical Association Pages: 664-677 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2140053 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2140053 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:664-677 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2123336_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Ben Wu Author-X-Name-First: Ben Author-X-Name-Last: Wu Author-Name: Ying Guo Author-X-Name-First: Ying Author-X-Name-Last: Guo Author-Name: Jian Kang Author-X-Name-First: Jian Author-X-Name-Last: Kang Title: Bayesian Spatial Blind Source Separation via the Thresholded Gaussian Process Abstract: Blind source separation (BSS) aims to separate latent source signals from their mixtures. For spatially dependent signals in high-dimensional and large-scale data, such as neuroimaging, most existing BSS methods do not take into account the spatial dependence and the sparsity of the latent source signals. To address these major limitations, we propose a Bayesian spatial blind source separation (BSP-BSS) approach for neuroimaging data analysis. We assume the expectation of the observed images as a linear mixture of multiple sparse and piece-wise smooth latent source signals, for which we construct a new class of Bayesian nonparametric prior models by thresholding Gaussian processes. We assign the vMF priors to mixing coefficients in the model. Under some regularity conditions, we show that the proposed method has several desirable theoretical properties including the large support for the priors, the consistency of joint posterior distribution of the latent source intensity functions and the mixing coefficients, and the selection consistency on the number of latent sources. We use extensive simulation studies and an analysis of the resting-state fMRI data in the Autism Brain Imaging Data Exchange (ABIDE) study to demonstrate that BSP-BSS outperforms the existing method for separating latent brain networks and detecting activated brain activation in the latent sources. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 422-433 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2123336 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2123336 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:422-433 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2120401_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Zhen Miao Author-X-Name-First: Zhen Author-X-Name-Last: Miao Author-Name: Weihao Kong Author-X-Name-First: Weihao Author-X-Name-Last: Kong Author-Name: Ramya Korlakai Vinayak Author-X-Name-First: Ramya Korlakai Author-X-Name-Last: Vinayak Author-Name: Wei Sun Author-X-Name-First: Wei Author-X-Name-Last: Sun Author-Name: Fang Han Author-X-Name-First: Fang Author-X-Name-Last: Han Title: Fisher-Pitman Permutation Tests Based on Nonparametric Poisson Mixtures with Application to Single Cell Genomics Abstract: This article investigates the theoretical and empirical performance of Fisher-Pitman-type permutation tests for assessing the equality of unknown Poisson mixture distributions. Building on nonparametric maximum likelihood estimators (NPMLEs) of the mixing distribution, these tests are theoretically shown to be able to adapt to complicated unspecified structures of count data and also consistent against their corresponding ANOVA-type alternatives; the latter is a result in parallel to classic claims made by Robinson. The studied methods are then applied to a single-cell RNA-seq data obtained from different cell types from brain samples of autism subjects and healthy controls; empirically, they unveil genes that are differentially expressed between autism and control subjects yet are missed using common tests. For justifying their use, rate optimality of NPMLEs is also established in settings similar to nonparametric Gaussian (Wu and Yang) and binomial mixtures (Tian, Kong, and Valiant; Vinayak et al.). Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 394-406 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2120401 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2120401 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:394-406 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2119983_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Jun Tao Author-X-Name-First: Jun Author-X-Name-Last: Tao Author-Name: Bing Li Author-X-Name-First: Bing Author-X-Name-Last: Li Author-Name: Lingzhou Xue Author-X-Name-First: Lingzhou Author-X-Name-Last: Xue Title: An Additive Graphical Model for Discrete Data Abstract: We introduce a nonparametric graphical model for discrete node variables based on additive conditional independence. Additive conditional independence is a three-way statistical relation that shares similar properties with conditional independence by satisfying the semi-graphoid axioms. Based on this relation we build an additive graphical model for discrete variables that does not suffer from the restriction of a parametric model such as the Ising model. We develop an estimator of the new graphical model via the penalized estimation of the discrete version of the additive precision operator and establish the consistency of the estimator under the ultrahigh-dimensional setting. Along with these methodological developments, we also exploit the properties of discrete random variables to uncover a deeper relation between additive conditional independence and conditional independence than previously known. The new graphical model reduces to a conditional independence graphical model under certain sparsity conditions. We conduct simulation experiments and analysis of an HIV antiretroviral therapy dataset to compare the new method with existing ones. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 368-381 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2119983 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2119983 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:368-381 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2142591_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Davide Viviano Author-X-Name-First: Davide Author-X-Name-Last: Viviano Author-Name: Jelena Bradic Author-X-Name-First: Jelena Author-X-Name-Last: Bradic Title: Fair Policy Targeting Abstract: One of the major concerns of targeting interventions on individuals in social welfare programs is discrimination: individualized treatments may induce disparities across sensitive attributes such as age, gender, or race. This article addresses the question of the design of fair and efficient treatment allocation rules. We adopt the nonmaleficence perspective of “first do no harm”: we select the fairest allocation within the Pareto frontier. We cast the optimization into a mixed-integer linear program formulation, which can be solved using off-the-shelf algorithms. We derive regret bounds on the unfairness of the estimated policy function and small sample guarantees on the Pareto frontier under general notions of fairness. Finally, we illustrate our method using an application from education economics. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 730-743 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2142591 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2142591 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:730-743 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2123813_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Zheng Tracy Ke Author-X-Name-First: Zheng Tracy Author-X-Name-Last: Ke Author-Name: Minzhe Wang Author-X-Name-First: Minzhe Author-X-Name-Last: Wang Title: Using SVD for Topic Modeling Abstract: The probabilistic topic model imposes a low-rank structure on the expectation of the corpus matrix. Therefore, singular value decomposition (SVD) is a natural tool of dimension reduction. We propose an SVD-based method for estimating a topic model. Our method constructs an estimate of the topic matrix from only a few leading singular vectors of the data matrix, and has a great advantage in memory use and computational cost for large-scale corpora. The core ideas behind our method include a pre-SVD normalization to tackle severe word frequency heterogeneity, a post-SVD normalization to create a low-dimensional word embedding that manifests a simplex geometry, and a post-SVD procedure to construct an estimate of the topic matrix directly from the embedded word cloud. We provide the explicit rate of convergence of our method. We show that our method attains the optimal rate in the case of long and moderately long documents, and it improves the rates of existing methods in the case of short documents. The key of our analysis is a sharp row-wise large-deviation bound for empirical singular vectors, which is technically demanding to derive and potentially useful for other problems. We apply our method to a corpus of Associated Press news articles and a corpus of abstracts of statistical papers. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 434-449 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2123813 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2123813 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:434-449 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2110877_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Chun-Hao Yang Author-X-Name-First: Chun-Hao Author-X-Name-Last: Yang Author-Name: Hani Doss Author-X-Name-First: Hani Author-X-Name-Last: Doss Author-Name: Baba C. Vemuri Author-X-Name-First: Baba C. Author-X-Name-Last: Vemuri Title: An Empirical Bayes Approach to Shrinkage Estimation on the Manifold of Symmetric Positive-Definite Matrices Abstract: The James–Stein estimator is an estimator of the multivariate normal mean and dominates the maximum likelihood estimator (MLE) under squared error loss. The original work inspired great interest in developing shrinkage estimators for a variety of problems. Nonetheless, research on shrinkage estimation for manifold-valued data is scarce. In this article, we propose shrinkage estimators for the parameters of the Log-Normal distribution defined on the manifold of N × N symmetric positive-definite matrices. For this manifold, we choose the Log-Euclidean metric as its Riemannian metric since it is easy to compute and has been widely used in a variety of applications. By using the Log-Euclidean distance in the loss function, we derive a shrinkage estimator in an analytic form and show that it is asymptotically optimal within a large class of estimators that includes the MLE, which is the sample Fréchet mean of the data. We demonstrate the performance of the proposed shrinkage estimator via several simulated data experiments. Additionally, we apply the shrinkage estimator to perform statistical inference in both diffusion and functional magnetic resonance imaging problems. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 259-272 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2110877 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2110877 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:259-272 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2287599_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Ting Ye Author-X-Name-First: Ting Author-X-Name-Last: Ye Title: Fundamentals of Causal Inference: With R Journal: Journal of the American Statistical Association Pages: 790-791 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2023.2287599 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2287599 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:790-791 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2104727_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Surya T. Tokdar Author-X-Name-First: Surya T. Author-X-Name-Last: Tokdar Author-Name: Sheng Jiang Author-X-Name-First: Sheng Author-X-Name-Last: Jiang Author-Name: Erika L. Cunningham Author-X-Name-First: Erika L. Author-X-Name-Last: Cunningham Title: Heavy-Tailed Density Estimation Abstract: A novel statistical method is proposed and investigated for estimating a heavy tailed density under mild smoothness assumptions. Statistical analyses of heavy-tailed distributions are susceptible to the problem of sparse information in the tail of the distribution getting washed away by unrelated features of a hefty bulk. The proposed Bayesian method avoids this problem by incorporating smoothness and tail regularization through a carefully specified semiparametric prior distribution, and is able to consistently estimate both the density function and its tail index at near minimax optimal rates of contraction. A joint, likelihood driven estimation of the bulk and the tail is shown to help improve uncertainty assessment in estimating the tail index parameter and offer more accurate and reliable estimates of the high tail quantiles compared to thresholding methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 163-175 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2104727 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2104727 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:163-175 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2115375_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Emre Demirkaya Author-X-Name-First: Emre Author-X-Name-Last: Demirkaya Author-Name: Yingying Fan Author-X-Name-First: Yingying Author-X-Name-Last: Fan Author-Name: Lan Gao Author-X-Name-First: Lan Author-X-Name-Last: Gao Author-Name: Jinchi Lv Author-X-Name-First: Jinchi Author-X-Name-Last: Lv Author-Name: Patrick Vossler Author-X-Name-First: Patrick Author-X-Name-Last: Vossler Author-Name: Jingbo Wang Author-X-Name-First: Jingbo Author-X-Name-Last: Wang Title: Optimal Nonparametric Inference with Two-Scale Distributional Nearest Neighbors Abstract: The weighted nearest neighbors (WNN) estimator has been popularly used as a flexible and easy-to-implement nonparametric tool for mean regression estimation. The bagging technique is an elegant way to form WNN estimators with weights automatically generated to the nearest neighbors (Steele 2009; Biau, Cérou, and Guyader 2010); we name the resulting estimator as the distributional nearest neighbors (DNN) for easy reference. Yet, there is a lack of distributional results for such estimator, limiting its application to statistical inference. Moreover, when the mean regression function has higher-order smoothness, DNN does not achieve the optimal nonparametric convergence rate, mainly because of the bias issue. In this work, we provide an in-depth technical analysis of the DNN, based on which we suggest a bias reduction approach for the DNN estimator by linearly combining two DNN estimators with different subsampling scales, resulting in the novel two-scale DNN (TDNN) estimator. The two-scale DNN estimator has an equivalent representation of WNN with weights admitting explicit forms and some being negative. We prove that, thanks to the use of negative weights, the two-scale DNN estimator enjoys the optimal nonparametric rate of convergence in estimating the regression function under the fourth-order smoothness condition. We further go beyond estimation and establish that the DNN and two-scale DNN are both asymptotically normal as the subsampling scales and sample size diverge to infinity. For the practical implementation, we also provide variance estimators and a distribution estimator using the jackknife and bootstrap techniques for the two-scale DNN. These estimators can be exploited for constructing valid confidence intervals for nonparametric inference of the regression function. The theoretical results and appealing finite-sample performance of the suggested two-scale DNN method are illustrated with several simulation examples and a real data application. Journal: Journal of the American Statistical Association Pages: 297-307 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2115375 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2115375 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:297-307 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2258595_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Daniel Mork Author-X-Name-First: Daniel Author-X-Name-Last: Mork Author-Name: Marianthi-Anna Kioumourtzoglou Author-X-Name-First: Marianthi-Anna Author-X-Name-Last: Kioumourtzoglou Author-Name: Marc Weisskopf Author-X-Name-First: Marc Author-X-Name-Last: Weisskopf Author-Name: Brent A. Coull Author-X-Name-First: Brent A. Author-X-Name-Last: Coull Author-Name: Ander Wilson Author-X-Name-First: Ander Author-X-Name-Last: Wilson Title: Heterogeneous Distributed Lag Models to Estimate Personalized Effects of Maternal Exposures to Air Pollution Abstract: Children’s health studies support an association between maternal environmental exposures and children’s birth outcomes. A common goal is to identify critical windows of susceptibility—periods during gestation with increased association between maternal exposures and a future outcome. The timing of the critical windows and magnitude of the associations are likely heterogeneous across different levels of individual, family, and neighborhood characteristics. Using an administrative Colorado birth cohort we estimate the individualized relationship between weekly exposures to fine particulate matter (PM2.5) during gestation and birth weight. To achieve this goal, we propose a statistical learning method combining distributed lag models and Bayesian additive regression trees to estimate critical windows at the individual level and identify characteristics that induce heterogeneity from a high-dimensional set of potential modifying factors. We find evidence of heterogeneity in the PM2.5—birth weight relationship, with some mother—child dyads showing a three times larger decrease in birth weight for an IQR increase in exposure (5.9–8.5 μg/m3 PM2.5) compared to the population average. Specifically, we find increased vulnerability for non-Hispanic mothers who are either younger, have higher body mass index or lower educational attainment. Our case study is the first precision health study of critical windows. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 14-26 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2023.2258595 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2258595 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:14-26 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2133719_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Tucker McElroy Author-X-Name-First: Tucker Author-X-Name-Last: McElroy Author-Name: Dimitris N. Politis Author-X-Name-First: Dimitris N. Author-X-Name-Last: Politis Title: Estimating the Spectral Density at Frequencies Near Zero Abstract: Estimating the spectral density function f(w) for some w∈[−π,π] has been traditionally performed by kernel smoothing the periodogram and related techniques. Kernel smoothing is tantamount to local averaging, that is, approximating f(w) by a constant over a window of small width. Although f(w) is uniformly continuous and periodic with period 2π, in this article we recognize the fact that w = 0 effectively acts as a boundary point in the underlying kernel smoothing problem, and the same is true for w=±π. It is well-known that local averaging may be suboptimal in kernel regression at (or near) a boundary point. As an alternative, we propose a local polynomial regression of the periodogram or log-periodogram when w is at (or near) the points 0 or ±π. The case w = 0 is of particular importance since f(0) is the large-sample variance of the sample mean; hence, estimating f(0) is crucial in order to conduct any sort of inference on the mean. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 612-624 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2133719 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2133719 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:612-624 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2106234_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Ganggang Xu Author-X-Name-First: Ganggang Author-X-Name-Last: Xu Author-Name: Jingfei Zhang Author-X-Name-First: Jingfei Author-X-Name-Last: Zhang Author-Name: Yehua Li Author-X-Name-First: Yehua Author-X-Name-Last: Li Author-Name: Yongtao Guan Author-X-Name-First: Yongtao Author-X-Name-Last: Guan Title: Bias-Correction and Test for Mark-Point Dependence with Replicated Marked Point Processes Abstract: Mark-point dependence plays a critical role in research problems that can be fitted into the general framework of marked point processes. In this work, we focus on adjusting for mark-point dependence when estimating the mean and covariance functions of the mark process, given independent replicates of the marked point process. We assume that the mark process is a Gaussian process and the point process is a log-Gaussian Cox process, where the mark-point dependence is generated through the dependence between two latent Gaussian processes. Under this framework, naive local linear estimators ignoring the mark-point dependence can be severely biased. We show that this bias can be corrected using a local linear estimator of the cross-covariance function and establish uniform convergence rates of the bias-corrected estimators. Furthermore, we propose a test statistic based on local linear estimators for mark-point independence, which is shown to converge to an asymptotic normal distribution in a parametric n-convergence rate. Model diagnostics tools are developed for key model assumptions and a robust functional permutation test is proposed for a more general class of mark-point processes. The effectiveness of the proposed methods is demonstrated using extensive simulations and applications to two real data examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 217-231 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2106234 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2106234 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:217-231 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2126362_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Stijn Vansteelandt Author-X-Name-First: Stijn Author-X-Name-Last: Vansteelandt Author-Name: Oliver Dukes Author-X-Name-First: Oliver Author-X-Name-Last: Dukes Author-Name: Kelly Van Lancker Author-X-Name-First: Kelly Author-X-Name-Last: Van Lancker Author-Name: Torben Martinussen Author-X-Name-First: Torben Author-X-Name-Last: Martinussen Title: Assumption-Lean Cox Regression Abstract: Inference for the conditional association between an exposure and a time-to-event endpoint, given covariates, is routinely based on partial likelihood estimators for hazard ratios indexing Cox proportional hazards models. This approach is flexible and makes testing straightforward, but is nonetheless not entirely satisfactory. First, there is no good understanding of what it infers when the model is misspecified. Second, it is common to employ variable selection procedures when deciding which model to use. However, the bias and uncertainty that imperfect variable selection adds to the analysis is rarely acknowledged, rendering standard inferences biased and overly optimistic. To remedy this, we propose a nonparametric estimand which reduces to the main exposure effect parameter in a (partially linear) Cox model when that model is correct, but continues to capture the (conditional) association of interest in a well understood way, even when this model is misspecified in an arbitrary manner. We achieve an assumption-lean inference for this estimand based on its influence function under the nonparametric model. This has the further advantage that it makes the proposed approach amenable to the use of data-adaptive procedures (e.g., variable selection, machine learning), which we find to work well in simulation studies and a data analysis. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 475-484 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2126362 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2126362 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:475-484 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2118601_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Jing Zeng Author-X-Name-First: Jing Author-X-Name-Last: Zeng Author-Name: Qing Mai Author-X-Name-First: Qing Author-X-Name-Last: Mai Author-Name: Xin Zhang Author-X-Name-First: Xin Author-X-Name-Last: Zhang Title: Subspace Estimation with Automatic Dimension and Variable Selection in Sufficient Dimension Reduction Abstract: Sufficient dimension reduction (SDR) methods target finding lower-dimensional representations of a multivariate predictor to preserve all the information about the conditional distribution of the response given the predictor. The reduction is commonly achieved by projecting the predictor onto a low-dimensional subspace. The smallest such subspace is known as the Central Subspace (CS) and is the key parameter of interest for most SDR methods. In this article, we propose a unified and flexible framework for estimating the CS in high dimensions. Our approach generalizes a wide range of model-based and model-free SDR methods to high-dimensional settings, where the CS is assumed to involve only a subset of the predictors. We formulate the problem as a quadratic convex optimization so that the global solution is feasible. The proposed estimation procedure simultaneously achieves the structural dimension selection and coordinate-independent variable selection of the CS. Theoretically, our method achieves dimension selection, variable selection, and subspace estimation consistency at a high convergence rate under mild conditions. We demonstrate the effectiveness and efficiency of our method with extensive simulation studies and real data examples. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 343-355 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2118601 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2118601 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:343-355 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2293811_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Raymond K. W. Wong Author-X-Name-First: Raymond K. W. Author-X-Name-Last: Wong Title: Handbook of Matching and Weighting Adjustments for Causal Inference Journal: Journal of the American Statistical Association Pages: 791-791 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2023.2293811 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2293811 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:791-791 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2102502_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Monica Billio Author-X-Name-First: Monica Author-X-Name-Last: Billio Author-Name: Roberto Casarin Author-X-Name-First: Roberto Author-X-Name-Last: Casarin Author-Name: Matteo Iacopini Author-X-Name-First: Matteo Author-X-Name-Last: Iacopini Title: Bayesian Markov-Switching Tensor Regression for Time-Varying Networks Abstract: Modeling time series of multilayer network data is challenging due to the peculiar characteristics of real-world networks, such as sparsity and abrupt structural changes. Moreover, the impact of external factors on the network edges is highly heterogeneous due to edge- and time-specific effects. Capturing all these features results in a very high-dimensional inference problem. A novel tensor-on-tensor regression model is proposed, which integrates zero-inflated logistic regression to deal with the sparsity, and Markov-switching coefficients to account for structural changes. A tensor representation and decomposition of the regression coefficients are used to tackle the high-dimensionality and account for the heterogeneous impact of the covariate tensor across the response variables. The inference is performed following a Bayesian approach, and an efficient Gibbs sampler is developed for posterior approximation. Our methodology applied to financial and email networks detects different connectivity regimes and uncovers the role of covariates in the edge-formation process, which are relevant in risk and resource management. Code is available on GitHub. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 109-121 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2102502 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2102502 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:109-121 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2110878_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Chengchun Shi Author-X-Name-First: Chengchun Author-X-Name-Last: Shi Author-Name: Jin Zhu Author-X-Name-First: Jin Author-X-Name-Last: Zhu Author-Name: Shen Ye Author-X-Name-First: Shen Author-X-Name-Last: Ye Author-Name: Shikai Luo Author-X-Name-First: Shikai Author-X-Name-Last: Luo Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Author-Name: Rui Song Author-X-Name-First: Rui Author-X-Name-Last: Song Title: Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process Abstract: This article is concerned with constructing a confidence interval for a target policy’s value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this article, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy’s value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies. A Python implementation of the proposed procedure is available at https://github.com/Mamba413/cope. Journal: Journal of the American Statistical Association Pages: 273-284 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2110878 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2110878 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:273-284 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2261184_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Xingche Guo Author-X-Name-First: Xingche Author-X-Name-Last: Guo Author-Name: Donglin Zeng Author-X-Name-First: Donglin Author-X-Name-Last: Zeng Author-Name: Yuanjia Wang Author-X-Name-First: Yuanjia Author-X-Name-Last: Wang Title: A Semiparametric Inverse Reinforcement Learning Approach to Characterize Decision Making for Mental Disorders Abstract: Major depressive disorder (MDD) is one of the leading causes of disability-adjusted life years. Emerging evidence indicates the presence of reward processing abnormalities in MDD. An important scientific question is whether the abnormalities are due to reduced sensitivity to received rewards or reduced learning ability. Motivated by the probabilistic reward task (PRT) experiment in the EMBARC study, we propose a semiparametric inverse reinforcement learning (RL) approach to characterize the reward-based decision-making of MDD patients. The model assumes that a subject’s decision-making process is updated based on a reward prediction error weighted by the subject-specific learning rate. To account for the fact that one favors a decision leading to a potentially high reward, but this decision process is not necessarily linear, we model reward sensitivity with a nondecreasing and nonlinear function. For inference, we estimate the latter via approximation by I-splines and then maximize the joint conditional log-likelihood. We show that the resulting estimators are consistent and asymptotically normal. Through extensive simulation studies, we demonstrate that under different reward-generating distributions, the semiparametric inverse RL outperforms the parametric inverse RL. We apply the proposed method to EMBARC and find that MDD and control groups have similar learning rates but different reward sensitivity functions. There is strong statistical evidence that reward sensitivity functions have nonlinear forms. Using additional brain imaging data in the same study, we find that both reward sensitivity and learning rate are associated with brain activities in the negative affect circuitry under an emotional conflict task. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 27-38 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2023.2261184 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2261184 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:27-38 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2286293_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Giacomo Bormetti Author-X-Name-First: Giacomo Author-X-Name-Last: Bormetti Title: Stable Lévy Processes via Lamperti-Type Representations Journal: Journal of the American Statistical Association Pages: 789-790 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2023.2286293 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2286293 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:789-790 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2126782_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Jason M. Klusowski Author-X-Name-First: Jason M. Author-X-Name-Last: Klusowski Author-Name: Peter M. Tian Author-X-Name-First: Peter M. Author-X-Name-Last: Tian Title: Large Scale Prediction with Decision Trees Abstract: This article shows that decision trees constructed with Classification and Regression Trees (CART) and C4.5 methodology are consistent for regression and classification tasks, even when the number of predictor variables grows sub-exponentially with the sample size, under natural 0-norm and 1-norm sparsity constraints. The theory applies to a wide range of models, including (ordinary or logistic) additive regression models with component functions that are continuous, of bounded variation, or, more generally, Borel measurable. Consistency holds for arbitrary joint distributions of the predictor variables, thereby accommodating continuous, discrete, and/or dependent data. Finally, we show that these qualitative properties of individual trees are inherited by Breiman’s random forests. A key step in the analysis is the establishment of an oracle inequality, which allows for a precise characterization of the goodness of fit and complexity tradeoff for a mis-specified model. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 525-537 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2126782 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2126782 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:525-537 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2102985_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Hanzhong Liu Author-X-Name-First: Hanzhong Author-X-Name-Last: Liu Author-Name: Jiyang Ren Author-X-Name-First: Jiyang Author-X-Name-Last: Ren Author-Name: Yuehan Yang Author-X-Name-First: Yuehan Author-X-Name-Last: Yang Title: Randomization-based Joint Central Limit Theorem and Efficient Covariate Adjustment in Randomized Block 2K Factorial Experiments Abstract: Randomized block factorial experiments are widely used in industrial engineering, clinical trials, and social science. Researchers often use a linear model and analysis of covariance to analyze experimental results; however, limited studies have addressed the validity and robustness of the resulting inferences because assumptions for a linear model might not be justified by randomization in randomized block factorial experiments. In this article, we establish a new finite population joint central limit theorem for usual (unadjusted) factorial effect estimators in randomized block 2K factorial experiments. Our theorem is obtained under a randomization-based inference framework, making use of an extension of the vector form of the Wald–Wolfowitz–Hoeffding theorem for a linear rank statistic. It is robust to model misspecification, numbers of blocks, block sizes, and propensity scores across blocks. To improve the estimation and inference efficiency, we propose four covariate adjustment methods. We show that under mild conditions, the resulting covariate-adjusted factorial effect estimators are consistent, jointly asymptotically normal, and generally more efficient than the unadjusted estimator. In addition, we propose Neyman-type conservative estimators for the asymptotic covariances to facilitate valid inferences. Simulation studies and a clinical trial data analysis demonstrate the benefits of the covariate adjustment methods. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 136-150 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2102985 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2102985 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:136-150 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2128807_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Matteo Barigozzi Author-X-Name-First: Matteo Author-X-Name-Last: Barigozzi Author-Name: Giuseppe Cavaliere Author-X-Name-First: Giuseppe Author-X-Name-Last: Cavaliere Author-Name: Lorenzo Trapani Author-X-Name-First: Lorenzo Author-X-Name-Last: Trapani Title: Inference in Heavy-Tailed Nonstationary Multivariate Time Series Abstract: We study inference on the common stochastic trends in a nonstationary, N-variate time series yt, in the possible presence of heavy tails. We propose a novel methodology which does not require any knowledge or estimation of the tail index, or even knowledge as to whether certain moments (such as the variance) exist or not, and develop an estimator of the number of stochastic trends m based on the eigenvalues of the sample second moment matrix of yt. We study the rates of such eigenvalues, showing that the first m ones diverge, as the sample size T passes to infinity, at a rate faster by O(T) than the remaining N – m ones, irrespective of the tail index. We thus exploit this eigen-gap by constructing, for each eigenvalue, a test statistic which diverges to positive infinity or drifts to zero according to whether the relevant eigenvalue belongs to the set of the first m eigenvalues or not. We then construct a randomized statistic based on this, using it as part of a sequential testing procedure, ensuring consistency of the resulting estimator of m. We also discuss an estimator of the common trends based on principal components and show that, up to a an invertible linear transformation, such estimator is consistent in the sense that the estimation error is of smaller order than the trend itself. Importantly, we present the case in which we relax the standard assumption of iid innovations, by allowing for heterogeneity of a very general form in the scale of the innovations. Finally, we develop an extension to the large dimensional case. A Monte Carlo study shows that the proposed estimator for m performs particularly well, even in samples of small size. We complete the article by presenting two illustrative applications covering commodity prices and interest rates data. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 565-581 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2128807 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2128807 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:565-581 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2276742_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Insuk Seo Author-X-Name-First: Insuk Author-X-Name-Last: Seo Title: Martingale Methods in Statistics Journal: Journal of the American Statistical Association Pages: 787-789 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2023.2276742 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2276742 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:787-789 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2102986_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Wei Ma Author-X-Name-First: Wei Author-X-Name-Last: Ma Author-Name: Ping Li Author-X-Name-First: Ping Author-X-Name-Last: Li Author-Name: Li-Xin Zhang Author-X-Name-First: Li-Xin Author-X-Name-Last: Zhang Author-Name: Feifang Hu Author-X-Name-First: Feifang Author-X-Name-Last: Hu Title: A New and Unified Family of Covariate Adaptive Randomization Procedures and Their Properties Abstract: In clinical trials and other comparative studies, covariate balance is crucial for credible and efficient assessment of treatment effects. Covariate adaptive randomization (CAR) procedures are extensively used to reduce the likelihood of covariate imbalances occurring. In the literature, most studies have focused on balancing of discrete covariates. Applications of CAR with continuous covariates remain rare, especially when the interest goes beyond balancing only the first moment. In this article, we propose a family of CAR procedures that can balance general covariate features, such as quadratic and interaction terms. Our framework not only unifies many existing methods, but also introduces a much broader class of new and useful CAR procedures. We show that the proposed procedures have superior balancing properties; in particular, the convergence rate of imbalance vectors is OP(nϵ) for any ϵ>0 if all of the moments are finite for the covariate features, relative to OP(n) under complete randomization, where n is the sample size. Both the resulting convergence rate and its proof are novel. These favorable balancing properties lead to increased precision of treatment effect estimation in the presence of nonlinear covariate effects. The framework is applied to balance covariate means and covariance matrices simultaneously. Simulation and empirical studies demonstrate the excellent and robust performance of the proposed procedures. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 151-162 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2102986 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2102986 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:151-162 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2140054_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Jolien Ponnet Author-X-Name-First: Jolien Author-X-Name-Last: Ponnet Author-Name: Pieter Segaert Author-X-Name-First: Pieter Author-X-Name-Last: Segaert Author-Name: Stefan Van Aelst Author-X-Name-First: Stefan Author-X-Name-Last: Van Aelst Author-Name: Tim Verdonck Author-X-Name-First: Tim Author-X-Name-Last: Verdonck Title: Robust Inference and Modeling of Mean and Dispersion for Generalized Linear Models Abstract: Generalized Linear Models (GLMs) are a popular class of regression models when the responses follow a distribution in the exponential family. In real data the variability often deviates from the relation imposed by the exponential family distribution, which results in over- or underdispersion. Dispersion effects may even vary in the data. Such datasets do not follow the traditional GLM distributional assumptions, leading to unreliable inference. Therefore, the family of double exponential distributions has been proposed, which models both the mean and the dispersion as a function of covariates in the GLM framework. Since standard maximum likelihood inference is highly susceptible to the possible presence of outliers, we propose the robust double exponential (RDE) estimator. Asymptotic properties and robustness of the RDE estimator are discussed. A generalized robust quasi-deviance measure is introduced which constitutes the basis for a stable robust test. Simulations for binomial and Poisson models show the excellent performance of the RDE estimator and corresponding robust tests. Penalized versions of the RDE estimator are developed for sparse estimation with high-dimensional data and for flexible estimation via generalized additive models (GAMs). Real data applications illustrate the relevance of robust inference for dispersion effects in GLMs and GAMs. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 678-689 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2140054 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2140054 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:678-689 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2144737_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Xiao Wu Author-X-Name-First: Xiao Author-X-Name-Last: Wu Author-Name: Fabrizia Mealli Author-X-Name-First: Fabrizia Author-X-Name-Last: Mealli Author-Name: Marianthi-Anna Kioumourtzoglou Author-X-Name-First: Marianthi-Anna Author-X-Name-Last: Kioumourtzoglou Author-Name: Francesca Dominici Author-X-Name-First: Francesca Author-X-Name-Last: Dominici Author-Name: Danielle Braun Author-X-Name-First: Danielle Author-X-Name-Last: Braun Title: Matching on Generalized Propensity Scores with Continuous Exposures Abstract: In the context of a binary treatment, matching is a well-established approach in causal inference. However, in the context of a continuous treatment or exposure, matching is still underdeveloped. We propose an innovative matching approach to estimate an average causal exposure-response function under the setting of continuous exposures that relies on the generalized propensity score (GPS). Our approach maintains the following attractive features of matching: (a) clear separation between the design and the analysis; (b) robustness to model misspecification or to the presence of extreme values of the estimated GPS; (c) straightforward assessments of covariate balance. We first introduce an assumption of identifiability, called local weak unconfoundedness. Under this assumption and mild smoothness conditions, we provide theoretical guarantees that our proposed matching estimator attains point-wise consistency and asymptotic normality. In simulations, our proposed matching approach outperforms existing methods under settings with model misspecification or in the presence of extreme values of the estimated GPS. We apply our proposed method to estimate the average causal exposure-response function between long-term PM 2.5 exposure and all-cause mortality among 68.5 million Medicare enrollees, 2000–2016. We found strong evidence of a harmful effect of long-term PM 2.5 exposure on mortality. Code for the proposed matching approach is provided in the CausalGPS R package, which is available on CRAN and provides a computationally efficient implementation. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 757-772 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2144737 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2144737 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:757-772 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2142592_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Wanjun Liu Author-X-Name-First: Wanjun Author-X-Name-Last: Liu Author-Name: Xiufan Yu Author-X-Name-First: Xiufan Author-X-Name-Last: Yu Author-Name: Wei Zhong Author-X-Name-First: Wei Author-X-Name-Last: Zhong Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Title: Projection Test for Mean Vector in High Dimensions Abstract: This article studies the projection test for high-dimensional mean vectors via optimal projection. The idea of projection test is to project high-dimensional data onto a space of low dimension such that traditional methods can be applied. We first propose a new estimation for the optimal projection direction by solving a constrained and regularized quadratic programming. Then two tests are constructed using the estimated optimal projection direction. The first one is based on a data-splitting procedure, which achieves an exact t-test under normality assumption. To mitigate the power loss due to data-splitting, we further propose an online framework, which iteratively updates the estimation of projection direction when new observations arrive. We show that this online-style projection test asymptotically converges to the standard normal distribution. Various simulation studies as well as a real data example show that the proposed online-style projection test retains the Type I error rate well and is more powerful than other existing tests. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 744-756 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2142592 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2142592 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:744-756 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2131557_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Elizabeth L. Ogburn Author-X-Name-First: Elizabeth L. Author-X-Name-Last: Ogburn Author-Name: Oleg Sofrygin Author-X-Name-First: Oleg Author-X-Name-Last: Sofrygin Author-Name: Iván Díaz Author-X-Name-First: Iván Author-X-Name-Last: Díaz Author-Name: Mark J. van der Laan Author-X-Name-First: Mark J. Author-X-Name-Last: van der Laan Title: Causal Inference for Social Network Data Abstract: We describe semiparametric estimation and inference for causal effects using observational data from a single social network. Our asymptotic results are the first to allow for dependence of each observation on a growing number of other units as sample size increases. In addition, while previous methods have implicitly permitted only one of two possible sources of dependence among social network observations, we allow for both dependence due to transmission of information across network ties and for dependence due to latent similarities among nodes sharing ties. We propose new causal effects that are specifically of interest in social network settings, such as interventions on network ties and network structure. We use our methods to reanalyze an influential and controversial study that estimated causal peer effects of obesity using social network data from the Framingham Heart Study; after accounting for network structure we find no evidence for causal peer effects. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 597-611 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2131557 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2131557 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:597-611 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2105704_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Seyoung Park Author-X-Name-First: Seyoung Author-X-Name-Last: Park Author-Name: Eun Ryung Lee Author-X-Name-First: Eun Ryung Author-X-Name-Last: Lee Author-Name: Hongyu Zhao Author-X-Name-First: Hongyu Author-X-Name-Last: Zhao Title: Low-Rank Regression Models for Multiple Binary Responses and their Applications to Cancer Cell-Line Encyclopedia Data Abstract: In this article, we study high-dimensional multivariate logistic regression models in which a common set of covariates is used to predict multiple binary outcomes simultaneously. Our work is primarily motivated from many biomedical studies with correlated multiple responses such as the cancer cell-line encyclopedia project. We assume that the underlying regression coefficient matrix is simultaneously low-rank and row-wise sparse. We propose an intuitively appealing selection and estimation framework based on marginal model likelihood, and we develop an efficient computational algorithm for inference. We establish a novel high-dimensional theory for this nonlinear multivariate regression. Our theory is general, allowing for potential correlations between the binary responses. We propose a new type of nuclear norm penalty using the smooth clipped absolute deviation, filling the gap in the related non-convex penalization literature. We theoretically demonstrate that the proposed approach improves estimation accuracy by considering multiple responses jointly through the proposed estimator when the underlying coefficient matrix is low-rank and row-wise sparse. In particular, we establish the non-asymptotic error bounds, and both rank and row support consistency of the proposed method. Moreover, we develop a consistent rule to simultaneously select the rank and row dimension of the coefficient matrix. Furthermore, we extend the proposed methods and theory to a joint Ising model, which accounts for the dependence relationships. In our analysis of both simulated data and the cancer cell line encyclopedia data, the proposed methods outperform the existing methods in better predicting responses. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 202-216 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2105704 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2105704 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:202-216 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2126361_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Ivo V. Stoepker Author-X-Name-First: Ivo V. Author-X-Name-Last: Stoepker Author-Name: Rui M. Castro Author-X-Name-First: Rui M. Author-X-Name-Last: Castro Author-Name: Ery Arias-Castro Author-X-Name-First: Ery Author-X-Name-Last: Arias-Castro Author-Name: Edwin van den Heuvel Author-X-Name-First: Edwin Author-X-Name-Last: van den Heuvel Title: Anomaly Detection for a Large Number of Streams: A Permutation-Based Higher Criticism Approach Abstract: Anomaly detection when observing a large number of data streams is essential in a variety of applications, ranging from epidemiological studies to monitoring of complex systems. High-dimensional scenarios are usually tackled with scan-statistics and related methods, requiring stringent modeling assumptions for proper calibration. In this work we take a nonparametric stance, and propose a permutation-based variant of the higher criticism statistic not requiring knowledge of the null distribution. This results in an exact test in finite samples which is asymptotically optimal in the wide class of exponential models. We demonstrate the power loss in finite samples is minimal with respect to the oracle test. Furthermore, since the proposed statistic does not rely on asymptotic approximations it typically performs better than popular variants of higher criticism that rely on such approximations. We include recommendations such that the test can be readily applied in practice, and demonstrate its applicability in monitoring the content uniformity of an active ingredient for a batch-produced drug product. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 461-474 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2126361 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2126361 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:461-474 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2129059_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Shuang Zhou Author-X-Name-First: Shuang Author-X-Name-Last: Zhou Author-Name: Pallavi Ray Author-X-Name-First: Pallavi Author-X-Name-Last: Ray Author-Name: Debdeep Pati Author-X-Name-First: Debdeep Author-X-Name-Last: Pati Author-Name: Anirban Bhattacharya Author-X-Name-First: Anirban Author-X-Name-Last: Bhattacharya Title: A Mass-Shifting Phenomenon of Truncated Multivariate Normal Priors Abstract: We show that lower-dimensional marginal densities of dependent zero-mean normal distributions truncated to the positive orthant exhibit a mass-shifting phenomenon. Despite the truncated multivariate normal density having a mode at the origin, the marginal density assigns increasingly small mass near the origin as the dimension increases. The phenomenon accentuates with stronger correlation between the random variables. This surprising behavior has serious implications toward Bayesian constrained estimation and inference, where the prior, in addition to having a full support, is required to assign a substantial probability near the origin to capture flat parts of the true function of interest. A precise quantification of the mass-shifting phenomenon for both the prior and the posterior, characterizing the role of the dimension as well as the dependence, is provided under a variety of correlation structures. Without further modification, we show that truncated normal priors are not suitable for modeling flat regions and propose a novel alternative strategy based on shrinking the coordinates using a multiplicative scale parameter. The proposed shrinkage prior is shown to achieve optimal posterior contraction around true functions with potentially flat regions. Synthetic and real data studies demonstrate how the modification guards against the mass shifting phenomenon while retaining computational efficiency. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 582-596 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2129059 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2129059 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:582-596 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2238943_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Patrick M. Schnell Author-X-Name-First: Patrick M. Author-X-Name-Last: Schnell Author-Name: Matthew Wascher Author-X-Name-First: Matthew Author-X-Name-Last: Wascher Author-Name: Grzegorz A. Rempala Author-X-Name-First: Grzegorz A. Author-X-Name-Last: Rempala Title: Overcoming Repeated Testing Schedule Bias in Estimates of Disease Prevalence Abstract: During the COVID-19 pandemic, many institutions such as universities and workplaces implemented testing regimens with every member of some population tested longitudinally, and those testing positive isolated for some time. Although the primary purpose of such regimens was to suppress disease spread by identifying and isolating infectious individuals, testing results were often also used to obtain prevalence and incidence estimates. Such estimates are helpful in risk assessment and institutional planning and various estimation procedures have been implemented, ranging from simple test-positive rates to complex dynamical modeling. Unfortunately, the popular test-positive rate is a biased estimator of prevalence under many seemingly innocuous longitudinal testing regimens with isolation. We illustrate how such bias arises and identify conditions under which the test-positive rate is unbiased. Further, we identify weaker conditions under which prevalence is identifiable and propose a new estimator of prevalence under longitudinal testing. We evaluate the proposed estimation procedure via simulation study and illustrate its use on a dataset derived by anonymizing testing data from The Ohio State University. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 1-13 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2023.2238943 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2238943 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:1-13 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2108816_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Yi Chen Author-X-Name-First: Yi Author-X-Name-Last: Chen Author-Name: Yining Wang Author-X-Name-First: Yining Author-X-Name-Last: Wang Author-Name: Ethan X. Fang Author-X-Name-First: Ethan X. Author-X-Name-Last: Fang Author-Name: Zhaoran Wang Author-X-Name-First: Zhaoran Author-X-Name-Last: Wang Author-Name: Runze Li Author-X-Name-First: Runze Author-X-Name-Last: Li Title: Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection Abstract: We consider the stochastic contextual bandit problem under the high dimensional linear model. We focus on the case where the action space is finite and random, with each action associated with a randomly generated contextual covariate. This setting finds essential applications such as personalized recommendations, online advertisements, and personalized medicine. However, it is very challenging to balance the exploration and exploitation tradeoff. We modify the LinUCB algorithm in doubly growing epochs and estimate the parameter using the best subset selection method, which is easy to implement in practice. This approach achieves O(sT) regret with high probability, which is nearly independent of the “ambient” regression model dimension d. We further attain a sharper O(sT) regret by using the SupLinUCB framework and match the minimax lower bound of the low-dimensional linear stochastic bandit problem. Finally, we conduct extensive numerical experiments to empirically demonstrate our algorithms’ applicability and robustness. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 246-258 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2108816 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2108816 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:246-258 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2106868_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Chengchun Shi Author-X-Name-First: Chengchun Author-X-Name-Last: Shi Author-Name: Shikai Luo Author-X-Name-First: Shikai Author-X-Name-Last: Luo Author-Name: Yuan Le Author-X-Name-First: Yuan Author-X-Name-Last: Le Author-Name: Hongtu Zhu Author-X-Name-First: Hongtu Author-X-Name-Last: Zhu Author-Name: Rui Song Author-X-Name-First: Rui Author-X-Name-Last: Song Title: Statistically Efficient Advantage Learning for Offline Reinforcement Learning in Infinite Horizons Abstract: We consider reinforcement learning (RL) methods in offline domains without additional online data collection, such as mobile health applications. Most of existing policy optimization algorithms in the computer science literature are developed in online settings where data are easy to collect or simulate. Their generalizations to mobile health applications with a pre-collected offline dataset remain are less explored. The aim of this article is to develop a novel advantage learning framework in order to efficiently use pre-collected data for policy optimization. The proposed method takes an optimal Q-estimator computed by any existing state-of-the-art RL algorithms as input, and outputs a new policy whose value is guaranteed to converge at a faster rate than the policy derived based on the initial Q-estimator. Extensive numerical experiments are conducted to back up our theoretical findings. A Python implementation of our proposed method is available at https://github.com/leyuanheart/SEAL. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 232-245 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2106868 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2106868 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:232-245 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2123335_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Qian Xiao Author-X-Name-First: Qian Author-X-Name-Last: Xiao Author-Name: Yaping Wang Author-X-Name-First: Yaping Author-X-Name-Last: Wang Author-Name: Abhyuday Mandal Author-X-Name-First: Abhyuday Author-X-Name-Last: Mandal Author-Name: Xinwei Deng Author-X-Name-First: Xinwei Author-X-Name-Last: Deng Title: Modeling and Active Learning for Experiments with Quantitative-Sequence Factors Abstract: A new type of experiment that aims to determine the optimal quantities of a sequence of factors is eliciting considerable attention in medical science, bioengineering, and many other disciplines. Such studies require the simultaneous optimization of both quantities and sequence orders of several components which are called quantitative-sequence (QS) factors. Given the large and semi-discrete solution spaces in such experiments, efficiently identifying optimal or near-optimal solutions by using a small number of experimental trials is a nontrivial task. To address this challenge, we propose a novel active learning approach, called QS-learning, to enable effective modeling and efficient optimization for experiments with QS factors. QS-learning consists of three parts: a novel mapping-based additive Gaussian process (MaGP) model, an efficient global optimization scheme (QS-EGO), and a new class of optimal designs (QS-design). The theoretical properties of the proposed method are investigated, and optimization techniques using analytical gradients are developed. The performance of the proposed method is demonstrated via a real drug experiment on lymphoma treatment and several simulation studies. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 407-421 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2123335 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2123335 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:407-421 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2278201_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Anna Menacher Author-X-Name-First: Anna Author-X-Name-Last: Menacher Author-Name: Thomas E. Nichols Author-X-Name-First: Thomas E. Author-X-Name-Last: Nichols Author-Name: Chris Holmes Author-X-Name-First: Chris Author-X-Name-Last: Holmes Author-Name: Habib Ganjgahi Author-X-Name-First: Habib Author-X-Name-Last: Ganjgahi Title: Bayesian Lesion Estimation with a Structured Spike-and-Slab Prior Abstract: Neural demyelination and brain damage accumulated in white matter appear as hyperintense areas on T2-weighted MRI scans in the form of lesions. Modeling binary images at the population level, where each voxel represents the existence of a lesion, plays an important role in understanding aging and inflammatory diseases. We propose a scalable hierarchical Bayesian spatial model, called BLESS, capable of handling binary responses by placing continuous spike-and-slab mixture priors on spatially varying parameters and enforcing spatial dependency on the parameter dictating the amount of sparsity within the probability of inclusion. The use of mean-field variational inference with dynamic posterior exploration, which is an annealing-like strategy that improves optimization, allows our method to scale to large sample sizes. Our method also accounts for underestimation of posterior variance due to variational inference by providing an approximate posterior sampling approach based on Bayesian bootstrap ideas and spike-and-slab priors with random shrinkage targets. Besides accurate uncertainty quantification, this approach is capable of producing novel cluster size based imaging statistics, such as credible intervals of cluster size, and measures of reliability of cluster occurrence. Lastly, we validate our results via simulation studies and an application to the UK Biobank, a large-scale lesion mapping study with a sample size of 40,000 subjects. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 66-80 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2023.2278201 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2278201 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:66-80 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2270657_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Lijia Wang Author-X-Name-First: Lijia Author-X-Name-Last: Wang Author-Name: Y. X. Rachel Wang Author-X-Name-First: Y. X. Rachel Author-X-Name-Last: Wang Author-Name: Jingyi Jessica Li Author-X-Name-First: Jingyi Jessica Author-X-Name-Last: Li Author-Name: Xin Tong Author-X-Name-First: Xin Author-X-Name-Last: Tong Title: Hierarchical Neyman-Pearson Classification for Prioritizing Severe Disease Categories in COVID-19 Patient Data Abstract: COVID-19 has a spectrum of disease severity, ranging from asymptomatic to requiring hospitalization. Understanding the mechanisms driving disease severity is crucial for developing effective treatments and reducing mortality rates. One way to gain such understanding is using a multi-class classification framework, in which patients’ biological features are used to predict patients’ severity classes. In this severity classification problem, it is beneficial to prioritize the identification of more severe classes and control the “under-classification” errors, in which patients are misclassified into less severe categories. The Neyman-Pearson (NP) classification paradigm has been developed to prioritize the designated type of error. However, current NP procedures are either for binary classification or do not provide high probability controls on the prioritized errors in multi-class classification. Here, we propose a hierarchical NP (H-NP) framework and an umbrella algorithm that generally adapts to popular classification methods and controls the under-classification errors with high probability. On an integrated collection of single-cell RNA-seq (scRNA-seq) datasets for 864 patients, we explore ways of featurization and demonstrate the efficacy of the H-NP algorithm in controlling the under-classification errors regardless of featurization. Beyond COVID-19 severity classification, the H-NP algorithm generally applies to multi-class classification problems, where classes have a priority order. Supplementary materials for this article are available online. Journal: Journal of the American Statistical Association Pages: 39-51 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2023.2270657 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2270657 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:39-51 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2115917_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Xiaoyu Hu Author-X-Name-First: Xiaoyu Author-X-Name-Last: Hu Author-Name: Fang Yao Author-X-Name-First: Fang Author-X-Name-Last: Yao Title: Dynamic Principal Component Analysis in High Dimensions Abstract: Principal component analysis is a versatile tool to reduce dimensionality which has wide applications in statistics and machine learning. It is particularly useful for modeling data in high-dimensional scenarios where the number of variables p is comparable to, or much larger than the sample size n. Despite an extensive literature on this topic, researchers have focused on modeling static principal eigenvectors, which are not suitable for stochastic processes that are dynamic in nature. To characterize the change in the entire course of high-dimensional data collection, we propose a unified framework to directly estimate dynamic eigenvectors of covariance matrices. Specifically, we formulate an optimization problem by combining the local linear smoothing and regularization penalty together with the orthogonality constraint, which can be effectively solved by manifold optimization algorithms. We show that our method is suitable for high-dimensional data observed under both common and irregular designs, and theoretical properties of the estimators are investigated under lq(0≤q≤1) sparsity. Extensive experiments demonstrate the effectiveness of the proposed method in both simulated and real data examples. Journal: Journal of the American Statistical Association Pages: 308-319 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2022.2115917 File-URL: http://hdl.handle.net/10.1080/01621459.2022.2115917 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:308-319 Template-Type: ReDIF-Article 1.0 # input file: UASA_A_2273403_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a Author-Name: Ali Rahnavard Author-X-Name-First: Ali Author-X-Name-Last: Rahnavard Title: Statistical Analytics for Health Data Science with SAS and R Journal: Journal of the American Statistical Association Pages: 786-787 Issue: 545 Volume: 119 Year: 2024 Month: 1 X-DOI: 10.1080/01621459.2023.2273403 File-URL: http://hdl.handle.net/10.1080/01621459.2023.2273403 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:786-787